When modelling real-valued sequences, a typical approach in current RNN architectures is to use a Gaussian mixture model to describe the conditional output distribution. In this paper, we argue that mixture-based distributions could exhibit structural limitations when faced with highly complex data distributions such as for spatial densities. To address this issue, we introduce recurrent flow networks which combine deterministic and stochastic recurrent hidden states with conditional normalizing flows to form a probabilistic neural generative model capable of describing the kind of variability observed in highly structured spatio-temporal data. Inspired by the model's factorization, we further devise a structured variational inference network to approximate the intractable posterior distribution by exploiting a spatial representation of the data. We empirically evaluate our model against other generative models for sequential data on three real-world datasets for the task of spatio-temporal transportation demand modelling. Results show how the added flexibility allows our model to generate distributions matching potentially complex urban topologies.READ FULL TEXT VIEW PDF
A novel predictor for traffic flow forecasting, namely spatio-temporal
The conditional extremes framework allows for event-based stochastic mod...
Spatio-temporal prediction is challenging attributing to the high
Spatio-temporal forecasting has numerous applications in analyzing wirel...
Due to rapid expansion of urban areas in recent years, management of cur...
Predicting ambulance demand accurately at a fine resolution in time and ...
The recurrent neural networks (RNN) with richly distributed internal sta...
Building well-specified probabilistic models for sequential data is a long-standing challenge of the statistical sciences and machine learning. Historically, dynamic Bayesian networks (DBNs), such as hidden Markov models (HMMs) and state space models (SSMs), have characterized a unifying probabilistic framework with illustrious successes in modelling time-dependent dynamics. Advances in deep learning architectures however, shifted this supremacy towards the field of Recurrent Neural Networks (RNNs). At a high level, both DBNs and RNNs can be framed as parametrisations of two core components: 1) atransition function characterising the time-dependent evolution of a learned internal representation, and 2) an emission function denoting a mapping from representation space to observation space.
Despite their attractive probabilistic interpretation, the biggest limitation preventing the widespread application of DBNs in the deep learning community, is that inference can be exact only for models typically characterized by either simple transition/emission functions (e.g. linear Gaussian models) or relatively simple internal representations. On the other hand, RNNs are able to learn long-term dependencies by parametrising a transition function of richly distributed deterministic hidden states. To do so, current RNNs typically rely on gated non-linearities such as long short-term memory (LSTMs)
cells and gated recurrent units (GRUs), allowing the learned representation to act as internal memory for the model.
More recently, evidence has been gathered in favor of combinations bringing together the representative power of RNNs with the consistent handling of uncertainties given by probabilistic approaches [6, 10, 21, 16, 13, 1, 3, 9]. The core concept underlying recent developments is the idea that, in current RNNs, the only source of variability is found in the conditional emission distribution (i.e. typically a unimodal distribution or a mixture of unimodal distributions), making these models inappropriate when modelling highly structured data. Most efforts have therefore concentrated in building models capable of effectively propagating uncertainty in the transition function of RNNs.
In this paper, we build on these recent advances by shifting the focus towards more flexible emission functions. We suggest that the traditional treatment of output variability through the parametrisation of unimodal (or mixtures of unimodal) distributions may act as a bottleneck in cases characterized by complex data distributions. We propose the use of Conditional Normalizing Flows (CNFs) 
as a general approach to define arbitrarily expressive output probability distributions under temporal dynamics.
In their basic form, normalizing flows act by propagating a simple initial distribution through a series of bijective transformations to produce a richer, more multimodal distribution. In this paper, we are specifically interested in modelling complex sequential data and propose a stochastic version of RNNs capable of exploiting the flexibility of normalizing flows in the conditional output distribution. On one hand, we model the temporal variability in the data through a transition function combining stochastic and deterministic states, on the other, we propose to use this mixed hidden representation as a conditioning variable to capture the output variability with a CNF. We call this model aRecurrent Flow Network (RFN).
We evaluate the proposed RFNs against both deterministic and stochastic variants of RNNs on three challenging spatio-temporal density estimation tasks. In particular, we focus on the problem of modelling the spatial distribution of transportation demand for the cases of New York, U.S.A. and Copenhagen, Denmark. For the explored tasks, we show how the additional emission flexibility allows RFNs to outperform mixture-based density models in capturing complex spatio-temporal dependencies. To summarize, the main contributions of this paper are threefold:
we propose a probabilistic model which is able to combine deterministic and stochastic temporal representations with the flexibility of normalizing flows in the conditional output distribution;
we use recent advances in variational inference to devise an inference network able to approximate in a scalable manner the intractable posterior distribution over the latent states by exploiting the spatial structure of the data.
Recurrent neural networks are widely used to model variable-length sequences , possibly influenced by external covariates . The core assumption underlying these models is that all observations up to time can be summarized by a learned deterministic representation . At any timestep , an RNN recursively updates its hidden state by computing:
where is a deterministic non-linear transition function parametrised by , such as an LSTM cell or a GRU. The sequence is then modelled by defining a factorization of the joint probability distribution as the following product of conditional probabilities:
where is typically a non-linear emission function with parameters .
When modelling complex real-valued sequences, a common choice is to represent the emission function with a mixture density network (MDN), as in . The idea behind MDNs is to use the output of a neural network to parametrise a Gaussian mixture model. In the context of RNNs, a subset of the outputs at time
is used to define the vector of mixture proportions, while the remaining outputs are used to define the means and covariances for the corresponding mixture components. Under this framework, the probability of is defined as follows:
where is the assumed number of components characterising the mixture.
As introduced in , a stochastic recurrent neural network (SRNN) represents a specific architecture combining deterministic RNNs with fully stochastic SSM layers. At a high level, SRNNs build a hierarchical internal representation by stacking a SSM transition on top of a RNN . The emission function is further defined by skip-connections mapping both deterministic () and stochastic () states to observation space (). Assuming that the starting hidden states and inputs are given, the model is defined by the following factorization:
where and represent again the emission and transition functions and where parameters are jointly optimized at inference time.
In order for the density of x to be well-defined, some important properties need to be satisfied. In particular, the transformation must be invertible and both and must be differentiable. Such a transformation is known as a diffeomorphism (i.e. a bijection having invertible inverse). If these properties are satisfied, the model distribution on x can be obtained by the change of variable formula:
where and the Jacobian is the matrix of all partial derivatives of . In practice, the transformation and the base distribution can have parameters of their own (e.g. could be a multivariate normal with mean and covariance also parametrised by any flexible function). The fundamental property which makes normalizing flows so attractive, is that invertible and differentiable transformations are composable. That is, given two transformations and , their composition is also invertible and differentiable, with inverse and Jacobian determinant given by:
As a result, this framework allows to construct arbitrarily complex transformations by composing multiple stages of simpler transformations, without sacrificing the ability of exactly calculating the (log) density .
In , the authors introduce a bijective function of particular interest for this paper. This transformation, known as an affine coupling layer, exploits the simple observation that the determinant of a triangular matrix can be efficiently computed as the product of its diagonal terms. Concretely, given a -dimensional input vector x and , this property is exploited by defining the output b of an affine coupling layer as follows:
where and are arbitrarily complex scale and translation functions from and is the element-wise or Hadamard product. Since the forward computation defined in Eq. (9) and Eq. (10) leaves the first components unchanged, these transformations are usually combined by composing coupling layers in an alternating pattern, so that components unchanged in one layer are effectively updated in the next (for a more in-depth treatment of normalizing flows, the reader is referred to [24, 20]).
In this section, we define the generative model and inference network characterising the RFN for the purpose of sequence modelling. RFNs explicitly model temporal dependencies by combining deterministic and stochastic layers. The resulting intractability of the posterior distribution over the latent states , as in the case of VAEs [19, 27], is further approached by learning a tractable approximation through amortized variational inference. The schematic view of the RFN is shown in Fig 1.
Generative model As in , the transition function of the RFN interlocks an SSM with an RNN:
where and represent the parameters of the conditional prior distribution over the stochastic hidden states . In our implementation, and
are respectively an LSTM cell and a deep feed-forward neural network, with parametersand . In Eq. (11), can also be a neural network extracting features from . Unlike the SRNN, the learned representations (i.e. , ) are used as conditioners for a CNF parametrising the output distribution. That is, for every time-step , we learn a complex distribution by defining the conditional base distribution and conditional coupling layers characterising the transformation as follows:
where and represent the parameters of the conditional base distribution (determined by a learnable function ), while and denote the conditional scale and translation functions characterising the coupling layers in the CNF. In our implementation, , and are parametrised by deep neural networks. Together, Eq. (13) and Eq. (14) define the emission function , enabling the generative model to result in the factorization in Eq. (4).
Inference The variational approximation defining the RFN directly depends on , and as follows:
where is an encoder network defining the parameters of the approximate posterior distribution and . Given the above structure, the generative and inference models are tied through the RNN hidden state , resulting in the factorization given by:
In addition to the explicit dependence of the approximate posterior on and , the inference network defined in Eq. (15) also exhibits an implicit dependence on and through . This implicit dependency on all information from the past can be considered as resembling a filtering approach from the state-space model literature . Denoting and as the set of model and variational parameters respectively, variational inference offers a scheme for jointly optimising parameters and computing an approximation to the posterior distribution by maximising the following step-wise evidence lower bound (i.e. ELBO):
The generative and inference models are therefore learned jointly in space, so that the variational approximation is effectively tracking a moving posterior .
In this paper, we are interested in modelling the spatio-temporal demand distribution for different transportation services. The complex spatial structure (in latitude-longitude space), together with the inherent temporal dynamics characterising the demand distribution, make this problem particularly relevant from both a methodological and applied standpoint. Being able to model and accurately forecast the need for transportation could allow service providers and institutions to guarantee more efficient systems, ultimately leading to reduced traffic congestion and lower emissions. We evaluate the proposed RFN222Code available at https://github.com/DanieleGammelli/recurrent-flow-nets on three transportation datasets:
NYC Taxi (NYC-P/D): This dataset is released by the New York City Taxi and Limousine Commission. We focused on aggregating the taxi demand in 2-hour bins for the month of March 2016 containing 249,637 trip geo-coordinates. We further differentiated the task of modelling pick-ups (i.e. where the demand is) and drop-offs (i.e. where people want to go). In what follows, we denote the two datasets as NYC-P and NYC-D respectively.
Copenhagen Bike-Share (CPH-BS): This dataset contains geo-coordinates from users accessing the smartphone app of Donkey Republic, one of the major bike sharing services in Copenhagen, Denmark. As for the case of New York, we aggregated the geo-coordinates in 2-hour bins for the month of August, resulting in 87,740 app accesses.
For both New York and Copenhagen experiments we process333In our implementation, we used a variation of the pre-processing from https://github.com/hughsalimbeni/bayesian_benchmarks/blob/master/bayesian_benchmarks/data.py the data so to discard corrupted geo-coordinates outside the area of interest. For the taxi experiments, we discarded coordinates related to trips either shorter than or longer than , while in the bike-sharing dataset, we ensured to keep only one app access from the same user in a window of minutes. In both cases we divide the data temporally into train/validation/test splits using a ratio of .
Training: We train each model using stochastic gradient ascent on the evidence lower bound defined in Eq. (17) using the Adam optimizer , with a starting learning rate of being reduced by a factor of every epochs without loss improvement (in our implementation, we used the ReduceLROnPlateau
scheduler in PyTorch with patience=100). As in, we found that annealing the KL term in Eq. (17) (using a scalar multiplier linearly increasing from 0 to 1 over the course of training) yielded better results. The final model was selected with an early-stopping procedure based on the validation performance. Training using a NVIDIA GeForce RTX 2080 Ti took around 6 hours for CPH-BS and around 9 hours for NYC-P/D.
Models: We compare the RFN with RNN, VRNN  and SRNN  models using two different MDN-based emission distributions. In particular, we compare against a GMM output parametrised by Gaussians with either diagonal (MDN-Diag) or full (MDN-Full) covariance matrix. Based on a random search, we use 50 and 30 mixtures for MDN-Diag and MDN-Full respectively.
For every model, we select a single layer of 128 LSTM cells. The feature extractor in Eq. (11) has three layers of 128 hidden units using rectified linear activations . For the VRNN, SRNN and RFN we also define a 128-dimensional latent state . Both the transition function from Eq. (12) and the inference network in Eq. (15
) use a single layer of 128 hidden units. For the mixture-based models, the MDN emission is further defined by two layers of 64 hidden units where we use a softplus activation to ensure the positivity of the variance vector in the MDN-Diag case and a Cholesky decomposition of the full covariance matrix in MDN-Full. The emission function in the RFN is defined as in Eq. (13) and Eq. (14), where , and
are neural networks with two layers of 128 hidden units. The conditional flow is further defined as an alternation of 35 layers of the triplet [Affine coupling layer, Batch Normalization, Permutation], where the permutation ensures that all dimensions are processed by the affine coupling layers and where the batch normalization ensures better propagation of the training signal, as shown in . In our experiments we define , although could potentially be used to introduce relevant information for the problem at hand (e.g. weather or special event data in the case of spatio-temporal transportation demand estimation).
All models were implemented using PyTorch  and the universal probabilistic programming language Pyro . To reduce computational cost, we use a single sample to approximate the intractable expectations in the ELBO.
Spatial representation: For the task of spatio-temporal density estimation, takes the form of a set of variable-length samples from the target distribution . That is, for every time-step , is a vector of geo-coordinates representing a corresponding number of taxi trips (NYC-P/D) or smartphone app accesses (CPH-BS). We propose to process the data into a representation enabling the models to effectively handle data in a single batch computation. As shown in Fig. 2, we choose to represent as a normalized 2-dimensional histogram (in our implementation we set ). Given its ability to preserve the spatial structure of the data, we believe this representation to be well suited for spatio-temporal density estimation tasks. More precisely, the proposed representation is obtained by applying the following three-step procedure: 1) select data , 2) build a 2-dimensional histogram computing the counts of the geo-coordinates falling in every cell of the grid and 3) normalize the histogram such that . By fixing , this enables the definition of a sequence generation problem over spatial densities. In practice, we found the above spatial representation to be both practical in dealing with variable-length geo-coordinate vectors, as well as effective, yielding better results. To the authors’ best knowledge, this spatial approximation of the target distribution has never been used for the task of spatio-temporal density modelling.
Results: In Table 1 we compare test log-likelihoods on the tasks of spatio-temporal demand modelling for the cases of New York and Copenhagen. We report exact log-likelihoods for both RNN-MDN-Diag and RNN-MDN-Full, while in the case of VRNNs, SRNNs and RFNs, given their inherent stochasticity, we report the importance sampling approximation to the marginal log-likelihood as stated in  using 30 samples. We see from Table 1 that RFN outperformes competing methods yielding higher log-likelihood. The results support our claim that more flexible output distributions are advantageous when modelling potentially complex and structured temporal data distributions.
In Fig. 3, we show a visualization of the predicted spatial densities from three of the implemented models at specific times of the day. The heatmap was generated by computing the approximation of the marginal log-likelihood, under the respective model, on a grid within the considered geographical boundaries. The final plot is further obtained by mapping the computed log-likelihoods back into latitude-longitude space. Opposed to GMM-based densities, the figures show how the RFN exploits the flexibily of conditional normalizing flows to generate sharper distributions capable of better approximating complex shapes such as geographical landforms or urban topologies.
. As we have in this paper, these works argue that simpler output models may turn out limiting when dealing with structured and potentially high dimensional data distributions (e.g. images, videos). The performance of these models is highly dependent on the specific architecture defined in the conditional output distribution, as well as on how stochasticity is propagated in the transition function. In this section we highlight how RFNs differ from some of these works.
In VideoFlow  and in , the authors similarly use normalizing flows to parametrise the emission function for the tasks of video generation and multi-variate time series forecasting, respectively. In VideoFlow, the latent states representing the temporal evolution of the system are defined by the conditional base distribution of a normalizing flow. This differs from our work where we explicitly model the temporal dynamics through a combination of latent variables and fully deterministic recurrent hidden states. We found this mixed hidden representation able to yield better performance in practice. Normalizing flows in RFNs are therefore exclusively used to model output variability through the emission function, rather than directly parametrising the recurrent hidden states. Moreover, the architecture of VideoFlow is inspired by Glow , and so specifically tailored for image generation tasks (e.g. through the proposed 3D multi-scale latent variables and the 3D dilated Convolutional Residual Network). Similarly to our work, in  the authors also propose to use conditional affine coupling layers in order to model the output variability. The RFNs differ clearly through the combination of stochastic and deterministic recurrent hidden states in the transition function.
In , the authors address the task of video generation by defining a hierarchical version of the VRNN and a Conv-LSTM decoder. While the latter also combines stochastic and deterministic states, as for the case of VideoFlow, the emission function is specifically focused on image modelling tasks.
This work addresses the problem of spatio-temporal density modelling by proposing the use of conditional normalizing flows as a general approach to parametrise the output distribution of recurrent latent variable models. We approximate the intractable posterior distribution over the latent states by devising an inference network able to exploit the spatio-temporal structure of the data distribution. We also propose to use a spatial representation of data to effectively represent samples from densities in geo-coordinate space within temporal models. Our experiments focus on real-world data for the task of transportation demand density modelling. We empirically show that the flexibility of normalizing flows enables RFNs to generate rich output distributions capable of describing potentially complex geographical surfaces.
In future work, similarly to , we plan to apply RFNs for the task of video generation. We believe the combination of deterministic and stochastic hidden representations could enable more reliable long-term predictions compared to fully stochastic states. We also plan to explore the role of multi-head attention mechanisms  for the efficient and effective learning of both long-term and short-term dependencies between the currently Markovian stochastic states.
The IEEE International Conference on Computer Vision (ICCV), October 2019.
Proceedings of the AAAI Conference on Artificial Intelligence, 33:3656–3663, 07 2019.
Deep variational bayes filters: Unsupervised learning of state space models from raw data, 2016.
Deep kalman filters, 2015.
Stochastic backpropagation and approximate inference in deep generative models, 2014.
Ladder variational autoencoders, 2016.