1 Introduction
In recent years, advances on neural data acquisition have made it possible to record the simultaneous sequential activity of up to thousands of neurons (
Paninski & Cunningham (2017)). The analysis of these datasets often focuses on dimensionality reduction techniques that encode the activity of the population in a lower dimensional latent trajectory (Cunningham & Yu (2014)). At the other extreme, there is a big body of detailed electrophysiological data coming from voltage measurements in single cells (Jones et al. (2009)). In this setting it is understood that the underlying dynamics is in fact highly nonlinear and multidimensional, though the experimenter only has access to a onedimensional (D) observation. From such D recordings, the task is to approximately recover the complete hidden latent space paths and dynamics.A host of sophisticated techniques has been proposed for the analysis of complex sequential data that is not well described by linear transitions and observations (Archer et al. (2015); Chung et al. (2015); Gao et al. (2016, 2015); Hernandez et al. (2017); Johnson et al. (2016); Krishnan et al. (2015, 2016); Linderman et al. (2017); Pandarinath et al. (2018); Sussillo et al. (2016); Zhao & Park (2017); Wu et al. (2017, 2018)). In that context, we here present Variational Inference for Nonlinear Dynamics (VIND). The main contribution of VIND is an algorithm that allows variational inference (VI) from structured, intractable approximations to the posterior distribution. In particular, VIND can handle variational posteriors that (i) represent nonlinear evolution in the latent space, and (ii) disentangle the latent dynamics (transition) from the data encoding (recognition).
VIND relies on two key ideas. Firstly, it makes use of the fact that given an intractable posterior , it is always possible to compute a tractable Gaussian approximation to it. This Gaussian approximation inherits its parameters from (Chung et al. (2015)), so optimizing for it can be interpreted as indirectly optimizing . The second novel aspect of VIND is the use of the fixedpoint iteration (FPI) method to significantly speed up the computation of the aforementioned Gaussian approximation.
In this work we focus on a VIND variant in which the latent dynamics is represented as a Locally Linear Dynamical System (LLDS). The running time of LLDS/VIND is linear in the number of time points in a trial. We are especially interested in determining LLDS/VIND’s ability to infer the hidden dynamics, as demonstrated by its generative / predictive capabilities. After training, can the hidden dynamics uncovered by VIND generate data that is indistinguishable from the original data, when provided with a suitable starting point? In the second half of this work we apply VIND to four datasets, one synthetic and three using experimental data, and show that VIND excels in this task, in some cases outperforming established methods by orders of magnitude in the predicted mean squared error (MSE).
2 Variational Inference with Nonlinear Dynamics (VIND)
For a set of temporally ordered, correlated, noisy observations ,
, a latent variable model proposes an additional, timeordered set of random variables
, that is hidden from view. The hidden state is endowed with a stochastic dynamics: by which it evolves. The observations are generated by drawing samples from adependent probability distribution.
A naive objective for such a model is the marginal loglikelihood , with the latent variables integrated out of the joint. However, it is well known that for anything other than the simplest distributions, marginalization with respect to is intractable (Bishop (2006)). VI overcomes this problem by approximating the posterior with a distribution , the Recognition Model (RM), from a tractable class. The objective becomes the celebrated ELBO, a lower bound to (Jordan et al. (1999)):
(1) 
Successful VI relies on the choice of the approximation . This choice is constrained by two goals that stand in tension: expressiveness and tractability. We approach the search for a suitable class of variational posterior by considering the joint . Our focus is on factorizations of the form:
(2) 
where the distribution parameters have been written explicitly. The unnormalized densities stand for an observation model that, for the purposes of this work, can be either Gaussian, , or Poisson, . The respective mean, , and rate, , are arbitrary nonlinear functions of the latent state
, that we represent as neural networks. The standard deviation
of the Gaussian observation model is taken to be independent. is a normalization constant. is the latent evolution term inspace with a Markov Chain structure:
(3)  
(4)  
(5) 
where is an arbitrary nonlinearity.
From Eq. (2), the posterior distribution of the Generative Model (GM) can be factorized as
(6) 
Guided by Eq. (6), we propose to include the GM evolution term into the variational posterior. We start then with an approximation that factorizes as:
(7) 
We refer to as the parent distribution. By design, the factor in the parent contains all the dependence on . For definiteness, the case
(8) 
is considered in this work, where and are nonlinear maps.
In Eq. (7), is a normalization constant that, regardless of the specific form of , cannot be computed in closed form. This is most easily seen by noticing that due to the nonlinearity , integration with respect to yields an intractable dependent factor. Hence, VI cannot be formulated directly in terms of the parent, since expectation values with respect to cannot be computed in closed form.
VIND represents a way out of this conundrum that, effectively, allows for the use of an intractable, unnormalized distribution as the Recognition Model in VI. The idea is to define a Gaussian approximation to the parent; this child distribution is then used as the variational posterior in Eq. (1). The inference problem becomes tractable since the child is normal. Importantly, the parameters in , with respect to which we optimize, are inherited from the parent. After training, they can be replaced back into obtaining, in particular, the nonlinear dynamics for the latent space. In this novel way, VIND achieves the reuse of the latent dynamics of the GM in the RM.
Concretely, let be a Laplace approximation to ,
(9) 
The mean in Eq. (9) is the solution to the equation
(10) 
and the precision is given by
(11) 
where Eq. (11) defines . A closed form solution for Eq. equation 10 is not possible in general. However, for a large class of distributions, and in particular for any such that includes terms quadratic in , it is possible to rewrite Eq. equation 10 in the form
(12) 
In this form, the latter can be approximately solved by making use of the FPI method. This approximate solution provides the mean for the variational approximation .
Henceforth, VIND’s algorithm includes two steps per epoch that are carried out in alternation (see Algorithm
1 in App. A). The first step is a FPI that, for the current values of the parameters, determines the mean and variance of a Laplace approximation to the parent. The second is a regular gradient ascent update with respect to the ELBO objective. Upon convergence, the set of parameters
that maximizes the ELBO can be plugged into, to obtain a dynamical rule that interpolates between the different latent trajectories inferred from the data trials.
In the experiments conducted in this paper, the nonlinear dynamics is specified as where is a statespace dependent matrix. We call this evolution rule, a Locally Linear Dynamical System, and the resulting inference algorithm LLDS/VIND. The detailed derivation of the equations of LLDS/VIND are developed in App. A.
3 Relation to Previous Work
The problem of inference for sequential data has been treated extensively in the literature. The GfLDS and PfLDS models introduced in Archer et al. (2015); Gao et al. (2016) are particular cases of VIND in which the dynamics in the latent space is linear and timeinvariant, i.e. . In the jargon used in this paper, this corresponds to the situation in which the parent distribution is Gaussian, and therefore equal to its own Laplace approximation. Eq. equation 10 can be solved analytically in this case and no FPI step is needed. Gaussian Process Factor Analysis (GPFA) Yu et al. (2009) assumes linear, timeinvariant dynamics as well as a linear observation model, i.e. , for some , , and . We will be explicitly comparing our results to results obtained by these models.
In Krishnan et al. (2015)
, Deep Kalman Filters (DKF) were proposed to handle variational posterior distributions that describes nonlinear evolution in the latent space. Their approximate posterior, analogous to the parent distribution in this paper, is plugged directly into the ELBO. This imposes some restrictions in the form the posterior can take  for instance, it must be Gaussian conditioned on the observations. VIND can handle factorizations of the parent distribution that are not restricted in this way, an example being LLDS/VIND, which has the form in Eq. equation
7. VIND’s ability to handle unnormalizable parent distributions is due to the fact that VIND’s actual approximate posterior is always strictly normal. The same authors built upon their idea in Krishnan et al. (2016), where a variational posterior was proposed that partially uses the conditional structure implied by the generative model. In this paper, a similar prescription is used by assuming that and share exactly the same factorization for the latent evolution.The authors of Johnson et al. (2016)
combine probabilistic graphical models with message passing in an approach based on conjugate priors. The approximate posterior distributions considered in that work are restricted by the conjugacy requirements, in particular, the evolution term must belong to the exponential family. VIND’s parent distribution is not subject to this requirement. However, since VIND’s actual approximate posterior is still Gaussian, it may be possible to combine the two methods into one that can handle both nonlinear evolution and discrete latent variables.
In Chung et al. (2015)
, Gaussian noise is added to the deterministic evolution rule of an RNN in the context of a variational autoencoder, termed VRNN. Similarly to LLDS/VIND, these authors share the evolution factorization between the generative model and the approximate posterior and, indeed, the only difference between the structure of their model and that of LLDS/VIND is that the evolution there is expressed as an RNN instead of as an LLDS. However, their inference algorithm only uses past data to estimate the hidden state at any given time. VIND’s algorithm, based on the FPI, uses information both from the past and from the future to estimate the latent paths. Finally, in
Pandarinath et al. (2018); Sussillo et al. (2016)a sophisticated, bidirectional, Deep Learningbased RNN architecture called LFADS was proposed with neuroscience applications in mind. The transition functions are deterministic, which may be a disadvantage in systems in which stochasticity plays an important role, such as the singlecell data of Sec.
4.3. For both LFADS and DKF, we found difficult to modify their code to compute the quantities that are used in this paper to evaluate the quality of training. However, given the expressive power of these works, we expect them to perform comparably to VIND in the tasks considered in the next section.4 Results
We demonstrate the capabilities of LLDS/VIND by applying it to four characteristic datasets, each of which illustrates a VIND feature. First, we use synthetically generated 10D data with added Gaussian noise, and latent evolution determined by an Euler discretization of the Lorenz system. This dataset is the simplest and cleanly illustrates VIND’s ability to infer the underlying nonlinear dynamics. Secondly we use a multielectrode neural recording from a mouse performing a delayeddiscrimination task. LLDS/VIND is run with both Gaussian and Poisson observation models. It is found that while a Gaussian observation model is superior for the explaining the variance in the data, the Poisson model performs better when it comes to interpolation of the dynamics. This is a common VIND tradeoff. The third dataset consists of a 1D voltage measurement from singlecell recordings. The problem in this case is not dimensionality reduction but rather to determine the underlying dynamics. We find that the minimum number of latent dimensions VIND requires to describe this dataset coincides with the naive expectation. Finally, we apply VIND to discovering dynamics in a more complex dataset coming from dorsal cortex calcium imaging, and find that it is possible to model the data using a surprisingly low number of latent dimensions. We also use VIND to reconstruct dynamics from one side of the brain to the other.
Given an inferred starting point in statespace, the quality of the dynamics uncovered by LLDS/VIND can be ascertained by evolving the system steps into the future without any input data. To clarify terminology, this is not strict prediction in the sense of pure extrapolation, since we use information about all , both in the past and in the future, to infer the starting point. In order to avoid doubt, we use the term forward interpolate. Forward interpolation essentially tests the extent to which the dynamics are accurately learned. We take VIND’s capability for forward interpolation as the main measure of the fit’s success. As we will show, this task remains highly challenging for simpler smoothing priors like the latent LDS, and it is one of the key strengths of VIND.
To make this analysis quantitative, we compute the step mean squared error (MSE) on test data, and its normalized version, the , defined as
(13) 
where is the data average for this trial and is the prediction at time . The latter is obtained by i) using the full data to obtain the best estimate for , ii) using times the LLDS/VIND evolution equation , or for the LDSs, to find the latent state time steps in the future, and iii) using the generative network to compute the forwardinterpolated observation. Note that in particular, corresponds to the standard . The more general ensures that VIND yields more than just a good autoencoder. We will be comparing results obtained with LLDS/VIND to several models, namely, GfLDS, PfLDS, and GPFA (see Sec. 3 for details).
4.1 Lorenz system
The Lorenz system is a classical nonlinear differential equation in 3 independent variables.
(14)  
This is a well studied system with chaotic solutions and that serves to cleanly demonstrate VIND’s capabilities for inferring nonlinear dynamics. We generated numerical solutions of the Lorenz system from randomly generated initial conditions, for , , , and additive Gaussian noise. Gaussian observations in a 10D space were then generated with the mean specified by a dependent neural network. The complete synthetic data consisted of 100 trials, each comprising 250 timesteps, of which 80% was used for training and the remaining for validation.
The results of the fit to this data are shown in Fig. 1. The left panel shows the comparison for VIND and GfLDS fits, with . Strikingly, for this dataset, VIND’s performance is 23 orders of magnitude better than GfLDS. The right panel illustrates VIND’s capability to infer properties of the underlying dynamics: VIND hits peak performance at , the true dimensionality of this system. In the rightmost panel, all the paths inferred by VIND have been put together, showing the famous butterfly pattern.
4.2 Electrophysiology
We analyzed data collected from mice performing a delayed discrimination task in a simultaneous recording session, using multiunit electrophysiology ( channel Janelia silicon probe) Guo et al. (2014); Li et al. (2015). In this task, the animals were trained to discriminate the location of a pole using whiskers. The pole was presented at s, and an auditory go cue at signaled the beginning of the response epoch. During response, the mice reported the perceived pole position by licking one of two lick ports. Neurons in this task exhibit complex dynamics across behavioral epochs; some neurons show ramping and persistent activity from sample to delay, which relates to the preparation of the choice at response Guo et al. (2014); Li et al. (2015); Wei et al. (2018), while some other neurons show the peaking activity in response to the behavioral epochs, see Fig. (2, left).
We asked whether VIND can capture the variety of neural dynamics using a few latent observations. The data was fitted for , using a Poisson observation model. The fit not only reproduces the neural observation, but also provides insights to the dynamics in the latent space. Details can be found in Fig. 6 in App. B. Subsequently, a fold crossvalidation method was used to decide the performance of fit using VIND’s Gaussian and Poisson observation models with up to dimensions in the latent space, regardless of trial type. The was computed to determine the performance of VIND as compared to other models. For VIND, both Poisson and Gaussian observation models were used. These are compared to a Peristumulus Time Histogram (PSTH), a GPFA model Yu et al. (2009), as well as GfLDS and PLDS Archer et al. (2015); Gao et al. (2016). The results are shown in the center panel in Fig. 2. We found that nonlinear Gaussian VIND performs the best regarding explained variance of the data.
Next we analyzed the forward interpolation capabilities of the fitted models to the neural data using . In this case the Poisson observation model gives a substantially better forward interpolation, signaling a dynamical system that more accurately represents the data evolution. This can be seen in the right panel in Fig. 6. These two results combined exemplify the VIND tradeoff between explained variance and forward interpolation capabilities. Using Poisson observations, VIND is less able to fit the higher frequency components of the data. The resulting dynamical system, however, is smoother and more appropriately captures the evolution of the system, see App. B.
4.3 Single Cell Voltage Data
We demonstrate VIND’s versatility to uncover underlying dynamics of a system by applying it to D voltage electrophysiology data recorded from single cells. This is not a dimensionality reduction problem but rather one of recovering the latent phase space from a single variable to identify the ‘true dimensionality’ of the system under study. The data is the publicly available Allen Brain Atlas dataset Jones et al. (2009).
Intracellular voltage recordings from cells from the Primary Visual Cortex of the mouse, area layer were selected. Trials with no spikes were removed, resulting in trials from different cells. The input for each of the remaining trials consists of a stepfunction with an amplitude between and pA. Observations were split into training ( trials) and validation sets ( trials). The data was then downsampled from time bins (sample rate of 50 kHz) to in equaltime intervals, and subsequently normalized by dividing each trial by its maximal value.
LLDS/VIND was fit to this data for , repeated across runs. The top three fits were averaged and the results are summarized in Fig. 3. The center panel displays the values for each choice of latent dimensionality. The fits consistently improve up to , after which there are diminishing returns. Single cell voltage data has traditionally been modeled using variants of the classical HodgkinHuxley neuron model (Hodgkin & Huxley (1952)), a set of nonlinear differential equations in 4 independent variables, plus an optional independent input current. It is interesting that 5 is exactly the minimal number of latent dimensions that provide a good VIND fit for this data. The right panel displays with for VIND and for GfLDS. VIND outperforms GfLDS by an order of magnitude.
We show the forwardinterpolated observations and sample paths for selected runs using VIND and GfLDS in Fig. 4 . The left panel represents the observations over a rolling window, timepoints in advance for both VIND and GfLDS. The dynamics inferred by GfLDS are unable to capture the nonlinearities for both the hyperpolarization and depolarization epochs. The latent trajectories for this data are plotted in the center panel. The dimensions inferred by VIND exhibit similar behavior to that of HodgkinHuxley gating variables. As shown in the right panel, we find that in statespace, spikes are represented by big cycles (red), while interspiking fluctuations correspond to separate regions of phase space (blue). For each trial, the height of the voltage spike is coded in the diameter of the cycles.
4.4 Spontaneous Activity in Widefield Imaging Data
The unsupervised modeling of spontaneous brain activity is inherently challenging due to the lack of task structure. Here, we model the temporal dynamics of widefield optical mapping (WFOM) data and simultaneous behavior recorded from an awake headfixed mouse during spontaneous activity Ma et al. (2016a). This data was recorded and corrected for hemodynamics in the Laboratory for Functional Optical Imaging at Columbia University. The recording and preprocessing details for both cortical dynamics and the movement speed signal are provided in App. C. An example frame of the data is shown in Fig. 5 (topleft). The preprocessing of the WFOM cortical data leads to reduceddimension, denoised cortical activity (as detailed in App. C). The temporal activity of the cortex and the movement speed (jointly called ) are simultaneously modeled using both GfLDS and VIND, with the results for validation data on one mouse shown in Fig. 5, where , and . The step forward interpolation is shown in Fig. 5 (bottomleft), with varying , for both VIND and GfLDS. of the dimensions of and on validation data are shown in Fig. 5 (center). VIND is seen to outperform GfLDS, capturing the finetuned dynamics in , thus also leading to better interpolations. We highlight VIND’s capability to roughly capture the dynamics of the whole superficial dorsal cortex using a
D latent vector and the corresponding evolution and generative network.
Next, a VIND model was fit to the brain dynamics of only the left hand side (LHS) of the brain, after similar preprocessing of the data. Here, . A separate neural network was fit from the latents learned on the left hand side () to the temporal dynamics of the right hand side (RHS) of the brain (;
), with an MSE loss function. The goal was to infer dynamics from one half of the brain to the other.
out of reconstructions of the temporal dynamics of the RHS in heldout data are shown in Fig. 5 (right) (variance weighted average for entire data). This shows that the latent variables learned by VIND on one half of the brain are useful to coarsely reconstruct the temporal dynamics of the other half.5 Discussion
In this work we introduced VIND, a novel variational inference framework for nonlinear latent dynamics that is able to handle intractable distributions. We successfully implemented the method for the specific case of Locally Linear Dynamical Systems, which allows for a fast inference algorithm (linear in ). When applied to real data, VIND consistently outperforms other methods, in particular methods that rely on an approximate posterior representing linear dynamics. Furthermore, VIND’s fits yield insights about the dynamics of these systems. Highlights are the ability to identify the transition points and distinguish among trial types in the electrophysiology task, the dimensionality suggested by VIND’s fits for the singlecell voltage data, and the ability of the latents learned from one half of the brain to reconstruct activity from the other half in widefield imaging data. Moreover, VIND can be naturally extended to handle labelled data and data with inputs. This is work in progress.
LLDS/VIND is written in tensorflow and the source code is publicly available at
www.github.com/dhernandd/vind.6 Acknowledgements
We thank Scott Linderman and Jonathan Pillow for comments on the paper, and Nuo Li for support for the electrophysiological recordings. D.H. is funded by NIH R01NS100066 and Simons Foundation 542963. Z.W. is funded by Howard Hughes Medical Institute and the Simons foundation collaboration on the global brain SCGB 542943SPI. S.S. is funded by the Swiss National Science Foundation (Research Award P2SKP2 178197). The Theory Center is funded by NSF NeuroNex DBI1707398 and by the Gatsby Charitable Foundation.
References
 Archer et al. (2015) Evan Archer, Il Memming Park, Lars Buesing, John Cunningham, and Liam Paninski. Black box variational inference for state space models, 2015.
 Bishop (2006) Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). SpringerVerlag, Berlin, Heidelberg, 2006. ISBN 0387310738.
 Buchanan et al. (2018) K. E. Buchanan, J. Friedrich, Ian Kinsella, Patrick Stinson, Pengcheng Zhou, Felipe Gerhard, John Ferrante, Graham Dempsey, and Liam Paninski. Constrained matrix factorization methods for denoising and demixing voltage imaging data. In Cosyne, 2018.
 Chung et al. (2015) Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C. Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. CoRR, abs/1506.02216, 2015.
 Cunningham & Yu (2014) John P Cunningham and Byron M Yu. Dimensionality reduction for largescale neural recordings. Nature Neuroscience, 17:1500 EP –, 08 2014.
 Gao et al. (2015) Yuanjun Gao, Lars Busing, Krishna V Shenoy, and John P Cunningham. Highdimensional neural spike train analysis with generalized count linear dynamical systems. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems 28, pp. 2044–2052. Curran Associates, Inc., 2015.
 Gao et al. (2016) Yuanjun Gao, Evan Archer, Liam Paninski, and John P. Cunningham. Linear dynamical neural population models through nonlinear embedding. NIPS 2016, 2016.
 Guo et al. (2014) Zengcai V. Guo, Nuo Li, Daniel Huber, Eran Ophir, Diego Gutnisky, Jonathan T. Ting, Guoping Feng, and Karel Svoboda. Flow of cortical activity underlying a tactile decision in mice. Neuron, 81(1):179–194, 2018/05/16 2014. doi: 10.1016/j.neuron.2013.10.020.
 Hernandez et al. (2017) Daniel Hernandez, Liam Paninski, and John Cunningham. Variational inference for nonlinear dynamics. TSW, NIPS 2017, 2017.
 Hodgkin & Huxley (1952) Alan L Hodgkin and Andrew F Huxley. A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of physiology, 117(4):500–544, 1952.

Jimenez Rezende et al. (2014)
D. Jimenez Rezende, S. Mohamed, and D. Wierstra.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
ICML2014, January 2014.  Johnson et al. (2016) Matthew J. Johnson, David Duvenaud, Alexander B. Wiltschko, Sandeep R. Datta, and Ryan P. Adams. Composing graphical models with neural networks for structured representations and fast inference, 2016.
 Jones et al. (2009) Allan R. Jones, Caroline C. Overly, and Susan M. Sunkin. The allen brain atlas: 5 years and beyond. Nature Reviews Neuroscience, 10:821 EP –, 10 2009.
 Jordan et al. (1999) Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, Nov 1999. ISSN 15730565. doi: 10.1023/A:1007665907178.
 Kingma & Welling (2013) D. P Kingma and M. Welling. AutoEncoding Variational Bayes. ArXiv eprints, December 2013.
 Krishnan et al. (2015) Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep kalman filters, 2015.
 Krishnan et al. (2016) Rahul G. Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models, 2016.
 Li et al. (2015) Nuo Li, TsaiWen Chen, Zengcai V. Guo, Charles R. Gerfen, and Karel Svoboda. A motor cortex circuit for motor planning and movement. Nature, 519:51 EP –, 02 2015.
 Linderman et al. (2017) Scott Linderman, Matthew Johnson, Andrew Miller, Ryan Adams, David Blei, and Liam Paninski. Bayesian learning and inference in recurrent switching linear dynamical systems. In Artificial Intelligence and Statistics, pp. 914–922, 2017.
 Ma et al. (2016a) Ying Ma, Mohammed A Shaik, Sharon H Kim, Mariel G Kozberg, David N Thibodeaux, Hanzhi T Zhao, Hang Yu, and Elizabeth MC Hillman. Widefield optical mapping of neural activity and brain haemodynamics: considerations and novel approaches. Phil. Trans. R. Soc. B, 371(1705):20150360, 2016a.
 Ma et al. (2016b) Ying Ma, Mohammed A Shaik, Mariel G Kozberg, Sharon H Kim, Jacob P Portes, Dmitriy Timerman, and Elizabeth MC Hillman. Restingstate hemodynamics are spatiotemporally coupled to synchronized and symmetric neural activity in excitatory neurons. Proceedings of the National Academy of Sciences, 113(52):E8463–E8471, 2016b.
 Pandarinath et al. (2018) Chethan Pandarinath, Daniel J. O’Shea, Jasmine Collins, Rafal Jozefowicz, Sergey D. Stavisky, Jonathan C. Kao, Eric M. Trautmann, Matthew T. Kaufman, Stephen I. Ryu, Leigh R. Hochberg, Jaimie M. Henderson, Krishna V. Shenoy, L. F. Abbott, and David Sussillo. Inferring singletrial neural population dynamics using sequential autoencoders. Nature Methods, 15(10):805–815, 2018. doi: 10.1038/s4159201801099. URL https://doi.org/10.1038/s4159201801099.

Paninski & Cunningham (2017)
Liam Paninski and John Cunningham.
Neural data science: accelerating the experimentanalysistheory cycle in largescale neuroscience.
bioRxiv, pp. 196949, 2017.  Sussillo et al. (2016) David Sussillo, Rafal Jozefowicz, L. F. Abbott, and Chethan Pandarinath. Lfads  latent factor analysis via dynamical systems, 2016.
 Wei et al. (2018) Ziqiang Wei, Inagaki Hidehiko, Li Nuo, Karel Svoboda, and Shaul Druckmann. An orderly singletrial organization of population dynamics in premotor cortex predicts behavioral variability. bioRxiv, pp. 376830, 2018.
 Wu et al. (2017) Anqi Wu, NG Roy, S Keeley, and JW Pillow. Gaussian process based nonlinear latent structure discovery in multivariate spike train data. NIPS 2017, 2017.
 Wu et al. (2018) Anqi Wu, S Pashkovski, R.S. Datta, and J.W. Pillow. Learning a latent manifold of odor representations from neural responses in piriform cortex. NIPS 2018, 2018.
 Yu et al. (2009) Byron M Yu, John P Cunningham, Gopal Santhanam, Stephen I Ryu, Krishna V Shenoy, and Maneesh Sahani. Gaussianprocess factor analysis for lowdimensional singletrial analysis of neural population activity. Journal of Neurophysiology, 102(1):614–635, 07 2009. doi: 10.1152/jn.90941.2008.
 Zhao & Park (2017) Yuan Zhao and Il Memming Park. Variational recursive dual filtering, 2017.
Appendix A Llds/vind
In this appendix, we provide details of the VIND framework for a specific parameterization of the hidden dynamics.
a.1 Theory and algorithm
Consider a parent distribution distribution , as defined in Eq. equation 7. The mean and the standard deviation in Eq. (8) are represented as deep neural networks:
(15) 
The remaining ingredient of is the shared evolution law , Eq. (3). We write the factors that determine the latent evolution model as
(16) 
which can be thought of as a “locally linear” dynamical system:
(17) 
We use the term LLDS/VIND to refer to VIND with this specific parameterization of the nonlinearity in Eq. equation 5. LLDS/VIND has a couple of desirable features:

The limit of linear evolution is easily taken as .

is a simple measure of the smoothness of the latent trajectories.
As mentioned in main the text, an equation for the Laplace mean is found by taking derivatives of with respect to the hidden path
(18) 
Using Eqs. (3) and (8) we find
(19) 
where and is a statespacedependent covariance with tridiagonal form:
(20) 
Here .
Inserting Eq. (19) into Eq. (18) we obtain the LLDS/VIND equation for the posterior
(21) 
with
(22) 
We remark that the matrix is blocktridiagonal and can be inverted in . Although Eq. (21) cannot be solved analytically, a solution can be approximated by choosing an initial point and iterating
(23) 
The VIND method assumes that this FPI converges. In practice, this assumption can be guaranteed throughout training by appropriate choices of hyperparameters and network architectures.
During training, the mean represents the best current estimate of the latent trajectory. Note that the FPI step in Eq. (23) mixes all the components in . In particular, the th component of depends in general on all the components of via the inverse covariance in Eq. (22). Thus, at every training epoch, the best estimate for the path at a specific time point contains information from the complete data, both to the past and to the future of . This is analogous to the Kalman smoother.
As detailed in Alg. 1
, each training epoch consists of two steps that are carried in alternate fashion: the FPI that updates the best estimate of the latent path and a gradient descent step that updates the model parameters. For optimization of the ELBO we used stochastic gradient descent. As it is customary, in order to estimate the gradients, the so called “reparameterization trick” is used. Samples are extracted from the variational approximation
(24) 
where is a standard normal, (Kingma & Welling (2013); Jimenez Rezende et al. (2014)).
a.2 Implementation details
We provide here extra details of the LLDS/VIND implementation that was used throughout the paper:

The initial path estimates are taken to be .

VIND is initializationsensitive. Empirically, we found that it is important that the initial path estimates fall within a region where the nonlinearity is not severe ( for every trial ).

To encourage smoothness of the latent dynamics, was specified as
(25) where
is a statespaceindependent linear transformation initialized to the identity,
is a nontrainable hyperparameter of the model, and . This set up has the added benefit that is equivalent, both the statistical model and the algorithm, to GfLDS/PfLDS (Archer et al. (2015); Gao et al. (2016)). 
The local transformation is redundant (it is akin to a gauge transformation in physics parlance). To see this, note that for every , the image of the transformation is at most . On the other hand has dimensionality . In other words, given and , there is a continuum of matrices that satisfy . It follows that can be substantially restricted without loss of generality (“fixing the gauge”). We found that the best results were obtained when was constrained to be symmetric.

Experiments showed that setting the number of FPIs at in Eq. (23) is enough for producing good convergence results across datasets.

In all experiments we found that there is no noticeable decrease in performance if the gradient terms in Eq. (22), and the corresponding ones for are neglected. These terms are subleading compared to both because they are proportional to the nonlinearity, small as required by smoothness, and because the gradient is applied on a deep neural network.
Appendix B Details of the electrophysiology fit
The experimental details in Sec. 4.2 are as follows. The recording session contains 18 simultaneous recorded units, with 74 lickright trials (in blue; posterior pole location) and 100 lickright trials (in red; anterior pole location). Spike counts were binned in a 67 ms nonoverlapped time window, where the maximum spike count is 10 and minimum is 0. The fit covers time points (almost the entire recordings) from 0.5 sec from sample onset to 2.0 sec from response (wherefore the trial contains 77 time bins).
Fig. 6A shows the average neuronal activity of 3 representative cells in the recordings. Cell is a typical neuron with small separation of trials, but strong peaking activity at transition from delay to response epochs. Cells exhibit the stereotypical ramping activity and separations of different trial types, which are assumed for preparation of the movements. Both setups of VIND models (Poisson and Gaussian, nonlinear evolutions, ) can reproduce the complex and variable neural dynamics in the heldout trials (9 lickleft trials; 9 lickright trials). Particular, Gaussian VIND model can capture the changes of dynamics on even finer timescales (dash lines vs dot lines; Poisson vs Gaussian VIND models). This observation agrees with higher performance of explained variance (; Fig. 2).
The latent dynamics are smoother in the Poisson VIND model, (Fig. 6BC). We have found that in VIND fits, smoother trajectories are correlated with superior performance in the forward interpolation tasks. Intuitively, for noisier latent paths, the algorithm attempts to ascribe some of the variance to the dynamical system, which hurts the forward interpolation capabilities. In the Poisson VIND fit represented in Fig. 6, the latent dynamics in dimension 2 and 3 appears to represent the preparation of the choice where the neural dynamics for different trial types gradually diverges with time. The dynamics in dimension 1 shows rapid peaking dynamics at the transitions of the behavioral epochs. However, those two types of dynamics were mixed and separations of trial types were in Gaussian VIND model. In general, ramping and peaking dynamics is not operated by distinguishable groups of neurons, yet to our surprise they are separated in the latent space.
Appendix C Preprocessing of Widefield Imaging Data
Macroscale widefield optical mapping (WFOM) is an increasingly popular technique for surveying neural activity over very large areas of cortex with high temporal resolution. WFOM can image the fluorescence of geneticallyencoded calcium (GCaMP6f) indicators using LED illumination and camera detection scheme. We use methods for correcting fluorescence recordings of neural activity for confounding contamination by changes in hemoglobin concentration and oxygenation as in Ma et al. (2016b), by measuring both neural fluorescence signals and hemodynamics. This correction provides us with an accurate change in fluorescence of neural regions (). An example frame of the data is shown in Fig. 5A, by pixels. The activity of the mouse is simultaneously recorded using a webcam pointed at the mouse’s body, and the movement speed at time is taken as a D signal consisting of the standard deviation of the difference in value of all pixels from time to time .
We use a WFOM recording of length minutes, where the signals are sampled at Hz, thus leading to time points. We normalize to lie between and
for every video, and then apply block singular value decomposition (SVD) to the videos for denoising and dimensionality reduction
Buchanan et al. (2018). First, we fit an anisotropic Wiener filter in a neighborhood of each pixel to reduce uncorrelated noise while preserving spatiallylocal, timecorrelated signals. Next, the video is partitioned into () blocks, and SVD is performed on the pixels in each block. The temporal components are ranked according to a metric defined on their empirical autocorrelation function, and components that fall within a 99% confidence interval of Gaussian white noise are discarded. Moreover, those temporal components that have a signaltonoise ratio lower than
are also discarded. The remaining temporal components from each block are concatenated, and these form the matrix, here . This is augmented using a D behavior signal that is extracted using the standard deviation of successive frames from a webcam recording the lateral view of the mouse’s body, representing the speed of the mouse’s movements in arbitrary units. We used different sessions of recording from the same mouse, preprocessed in the same way, to obtain training and validation data.