1 Introduction
Models of sequential data play an integral role in a range of different applications related to representation learning, sequence synthesis and prediciton. Thereby, with realworld data often being subject to noise and detection or annotation errors, probabilistic sequence models are favorable. These take uncertainty in the data into account and provide an implicit or explicit representation of the underlying probability distribution.
The determination of such a probabilistic sequence model is commonly layed out as a learning problem, learning a model of an unknown underlying stochastic process from given sample sequences. Common approaches are based on either Gaussian Processes [33] (e.g. [10, 28]) or more prevalently on neural networks, i.e. approximate Bayesian neural models (e.g. Baysian Neural Networks [5, 6, 13]
, Variational Autoencoders
[22, 36, 7]and Generative Adversarial Networks
[14, 29, 38]) or regressionbased neural models based on Mixture Density Networks (MDN) [4] (e.g. [16]). Approximate Bayesian neural models contain stochastic components and allow to sample from the modeled probability distribution. These models typically require Monte Carlo simulation or variational approximation during training and inference. On the other hand, MDNs are deterministic models, which parameterize a mixture distribution, making it potentially less expressive as a probabilistic model. This is due to the model merely learning to generate point estimates for the target distribution. Although MDNs are more stable and less computationally heavy during training, Monte Carlo simulation is still required for multimodal inference. A potential drawback common to both approaches is given by sequences usually being processed and generated in an iterative fashion. As a consequence, outliers occurring during Monte Carlo simulation may have a negative impact on all subsequent time steps.
In order to tackle difficulties with multimodal inference, Hug et al. [18, 19] proposed a variation of MDNs, which operate in the domain of parametric curves instead of the data domain, allowing to infer multiple time steps in a single inference step. The model is built on a probabilistic extension of Bézier curves (Curves), which themselves pose a special case for Gaussian processes (GP, [33]). Following this, this paper aims to exploit this relationship in order to reintroduce some of the expressiveness without losing the advantages of the regressionbased model. This is done by deriving the Gaussian process induced by an
Curve (mixture) as generated by the MDN. In terms of the GP, the MDN then generates its prior distribution, from which different posterior predictive distributions can be calculated under the presence of data. This, in turn, can for example be used for refining a prediction generated by the MDN in a sequence prediction task. This use case is considered for evaluating the viability of the combined model. Following this, the main contributions of this paper are:

The derivation of a new mean and covariance function for a GP derived from Curves, covering the variate as well as the multimodal case.

A combination of regressionbased probabilistic sequence prediction and GPbased prediction refinement.
2 Preliminaries
2.1 Gaussian Processes
A Gaussian process (GP, [33]) is a stochastic process with index set
, where the joint distribution of all stochastic variables
is a multivariate Gaussian distribution. For simplicity the index set will be interpreted as time throughout this paper. The joint distribution is obtained using an explicit mean function and positive definite covariance function, commonly referred to as the kernel of the Gaussian process, and yields a multivariate Gaussian prior probability distribution over function space. Commonly,
is assumed. Given a collection of sample points of a function , the posterior (predictive) distribution modeling nonobserved function values can be obtained through conditioning. As such, Gaussian processes provide a wellestablished model for probabilistic sequence modeling.2.2 Probabilistic Bézier Curves
Probabilistic Bézier Curves (Curves, [18, 19]) are Bézier curves [32] defined by independent dimensional Gaussian control points with . Through the curve construction function
(1) 
with
(2) 
and
(3) 
where are the Bernstein polynomials [26], the stochasticity is passed from the control points to the curve points , yielding a sequence of Gaussian distributions along the underlying Bézier curve. Thus, a stochastic process with index set can be defined. For representing discrete data, i.e. sequence of length , a discrete subset of can be employed for connecting sequence indices with evenly distributed values in , yielding
(4) 
3 Curve Gaussian Processes
With Curves providing a representation for stochastic processes
comprised of Gaussian random variables
, it can be shown, that Curves are a special case of GPs using an implicit covariance function. Following the definition of GPs [27, 33], anCurve can be classified as a GP, if for any finite subset
of , the joint probability density of corresponding random variables is Gaussian. This property is referred to as the GP propertyand can be shown to hold true by reformulating the curve construction formula into a linear transformation
^{1}^{1}1For clarity, multivariate random variables may be written in bold font occasionally.of the Gaussian control points stacked into a vector
(5) 
using a transformation matrix
(6) 
determined by the Bernstein polynomials, with and [19]. As is itself a Gaussian random vector,
is again Gaussian with its corresponding probability density function
being a Gaussian probability density.As the Gaussians along an Curve are correlated through the use of common control points and the way the curve is constructed, the kernel function can also be given explicitly. The following sections provide mean and kernel functions for the univariate, multivariate and multimodal case.
3.1 Univariate Gaussian Processes
Univariate GPs targeting scalarvalued functions are considered first being the most common use case. Further, it grants a simple case for deriving the mean and kernel functions induced by a given Curve while also allowing a visual examination of some properties of the induced GP, denoted as GP in the following. Here, the stochastic control points are thus defined by the mean value
and variance
. The mean function is equivalent to Eq. 2. Thus, this section focuses on the kernel function for two curve points and at indices and with . The respective mean values are given by and . From now follows:With , which follows from the independence of the control points, and , follows the closedform solution
(7) 
Considering the kernel function and the resulting Gram matrix, the GP is heavily dependent on the given set of control points, allowing for a range of different kernels. Fig. 1 illustrates a radial basis function (RBF, [15]) kernel
(8) 
with and , a linear kernel [15]
(9) 
with , and two GP kernels and . consists of two unit Gaussians, i.e. , and consists of
zero mean Gaussian control points with standard deviations
, , , and . Here, standard deviations vary in order to cope with nonlinear blending (see Eq. 3). Fig. 1 depicts the Gram matrix calculated from equally spaced values in for each kernel.When comparing the Gram matrices, it can be seen, that the Gram matrix calculated with is equal to that calculated with when normalizing its values to . On the other hand, the Gram matrix obtained with , which is derived from a more complex Curve, tends to be more comparable to the Gram matrix calculated with .
Assuming a zero mean GP, each kernel function defines a prior distribution. Following this, Fig. 2 depicts sample functions drawn from each prior distribution, again showing the parallels between the kernels.
As a final note, the GP is nonstationary, as its kernel function depends on the actual values of its inputs and . Further, it is nonperiodic and its smoothness is controlled by the underlying Bézier curve, i.e. the position and number of control points.
3.2 Multivariate Gaussian Processes
Multivariate GPs target vectorvalued functions , which map scalar inputs onto dimensional vectors, e.g.
. In this case, there exist two closely related approaches. The first revolves around the matrix normal distribution
[8, 9]. The other sticks with the multivariate Gaussian distribution and models vectorvalued function by using stacked mean vectors in combination with block partitioned covariance matrices [1]. Their relationship stems from the fact, that a matrix normal distribution can be transformed into a multivariate Gaussian distribution by vectorizing, i.e. stacking, the mean matrix vectors and calculating the covariance matrix as the Kronecker product of both scale matrices.The second approach is considered in the following, granting a more straightforward extension for the univariate case. Following this, the Gram matrix of a variate GP for a finite index subset with is given by the block partitioned matrix
(10) 
calculated using the matrixvalued kernel function . Here, and are now variate Gaussian random variables resulting from the linear combination of variate Curve control points . The kernel function
(11) 
is thus the multivariate generalization to the function given in Eq. 7 and yields a matrix.
The mean vector is defined as the concatenation of all point means
(12) 
where is the mean function according to Eq. 2 in the Curve definition.
3.3 Multimodal Gaussian Processes
With sequence modeling tasks often being multimodal problems and GPs as presented before being incapable of modeling such data, multimodal GPs are considered as a final case. A common approach to increasing the expressiveness of a statistical model, e.g. for heteroscedasticity or multimodality, is given by mixture modeling approaches. Thereby, rather than a single model or distribution a mixture of which are used, each component in the mixture covering a subset of the data. Generaly speaking, a widely used mixture model is given by the Gaussian mixture model
[3], which is defined as a convex combination of Gaussian distributions with mixing weights and probability density function(13) 
In the case of GPs, a popular approach is given by the mixture of Gaussian process experts [37, 34, 39], which extends on the mixture of experts model [21]. In this approach, the mixture model is comprised of a mixture of GP experts (components) with mean function and kernel function
(14) 
weighted using a conditional mixing weight distribution for a given sample . The weight distribution is generated by a gating network, which decides on the influence of each local expert for modeling a given sample. This is the key difference to the Gaussian mixture model, where the mixing weight distribution is determined a priori (e.g. via EM [12] or an MDN [4]). It can be noted that the mixture of experts approach is also oftentimes used to lower the computational load of a GP model, as less data points have to be considered during inference due to the use of local experts (e.g. [11, 24]).
In line with the mixture of Curves approach given in [18, 19], which builds on Gaussian mixture models, the multimodal extension of the GP is defined as a mixture of GPs
(15) 
where are the local GPs and with is the prior distribution over the mixing weights. This approach yields a GP prior, which is already adapted towards the data basis when setting the Curve parameters accordingly.
4 Experiments
In this section, the GP model is examined considering human trajectory prediction as an exemplary sequence prediction task. This task provides easy to interpret and visualize results while also providing a lot of complexity being a highly multimodal problem, despite low data dimensionality. In human trajectory prediction, given points of a trajectory as input, a sequence model is tasked to predict the subsequent trajectory points. This section focuses on the examination of the GP model as a refinement component and omits a stateoftheart comparison, as the underlying Curve model has been proven competetive on the given task in [19]. For this, the GP posterior distribution is calculated for different observed sequence points, which is expected to adapt the initial MDN prediction towards the actual test sample.
4.1 Parameter Estimation and Conditional Inference
For estimating the parameters of the GP prior distribution an MDN, which maps an input vector onto the parameters of a component Curve mixture, is used. In the context of sequence prediction, a common approach is to use a (recurrent) sequence encoder, such as an LSTM [17], for encoding an input sequence into , which is then passed through the MDN. For training the MDN using a set of fixedlength trajectories with
, the negative loglikelihood loss function
(16) 
can be applied in conjunction with a gradient descent policy [18, 19]. Although using an MDN merely provides a point estimate of the underlying data generating distribution, this is no major drawback in the context of GPs, as it solely serves the purpose of estimating a GP prior distribution, which can then be adapted towards the data basis.
After training, the MDN can be used for generating ML estimates of an input trajectory segment in combination with possible future trajectories defined in terms a Curve mixture. Using Equations 11, 12 and 15, this mixture additionally provides the GP prior distribution. In the context of this experimental section, a subset of the input is then used for calculating the GP posterior in order to tailor the mean prediction generated by the MDN to a given test trajectory. As the prior is a joint Gaussian mixture distribution over all modeled time steps it can be partitioned into a partition containing the time steps to condition on and the remaining time steps. Following this, a closedform solution exists for the posterior weights, mean vectors and covariance matrices (see e.g. [3, 31]). The probability distribution for each individual trajectory point can then be extracted through marginalization.
4.2 Experimental Setup
For this evaluation, scenes from commonly used datasets are considered: BIWI Walking Pedestrians ([30], scenes: ETH and Hotel), Crowds by Example ([25], scenes: Zara1 and Zara2) and the Stanford Drone Dataset ([35], scenes: Bookstore and Hyang). Following common practice, the annotation frequency of each dataset is adjusted to annotations per second. Further, the evaluation is conducted on trajectories of a fixed length . The MDN is trained independently on each dataset to generate Curve mixture estimates for complete trajectories of length . These initial predictions are then refined by calculating the posterior predictive distributions conditioned on the last input trajectory point (posterior A) and on (posterior B). Using an increasing number of observations is expected to adapt the prediction towards a given trajectory sample. For measuring the performance, the Average Displacement Error (ADE, [30, 23]) is applied according to the standard evaluation approach, using a maximum likelihood estimate. As the ADE does not provide an adequate measure for assessing the quality of (multimodal) probabilistic predictions, the Negative (data) LogLikelihood (NLL) is a common choice [2, 20] and will be used in addition to the ADE.
Prior  Posterior A  Posterior B  

ETH  MLADE  3.85 / 11.25  3.95 / 10.12  2.39 / 10.18 
NLL  6.51 / 7.58  5.43 / 7.08  115.09 / 1.70  
Hotel  MLADE  5.69 / 17.96  4.19 / 17.07  2.70 / 16.73 
NLL  6.99 / 8.20  5.56 / 7.71  10.63 / 8.59  
Zara1  MLADE  4.09 / 19.10  2.89 / 17.52  1.63 / 17.64 
NLL  6.83 / 8.18  5.27 / 7.63  51.93 / 8.15  
Zara2  MLADE  2.98 / 21.38  2.64 / 20.07  1.69 / 20.05 
NLL  6.59 / 8.09  5.08 / 7.59  60.95 / 1.76  
Bookstore  MLADE  4.04 / 17.21  3.65 / 15.97  2.16 / 16.29 
NLL  7.46 / 8.37  5.88 / 7.76  11.89 / 7.63  
Hyang  MLADE  5.51 / 36.46  5.01 / 34.05  3.16 / 32.18 
NLL  8.21 / 9.42  6.65 / 8.86  49.30 / 9.05 
4.3 Results
The quantitative results with respect to the selected performance measures are depicted in Table 1. Overall, an increase in performance can be observed when refining the estimate generated by the regressionbased neural network using and observed points, respectively. This is true for both the input and to predict portion of the modeled trajectory. Two examples highlighting common cases for a positive effect of the refinement on the estimate is given in Fig. 3. On the one hand, the refinement can lead to the estimate being pulled closer to the ground truth in the input portion, which expands far into the future prediction (first row). On the other hand, the refinement can lead to the suppression of inadequate mixture components in the posterior predictive distribution, which have had high weights assigned to them in the prior distribution (second row).
Besides the overall performance, it can be seen that in some instances conditioning on
observations (posterior distribution B) degrades the performance in comparison to using a single observation (posterior distribution A). With respect to the NLL, this can be attributed to an increased number of trajectory point variances decreasing or even collapsing, whereby trajectory points closer to the observed points are more affected. In this case, even minor inaccuracies in the mean prediction result in higher NLL values, even if the estimate is closer to the ground truth. Looking at the ADE, the loss in performance can most likely be attributed to the enforced interpolation of the observed points leading to unwanted deformations of the mean prediction. One of the main causes for this is given by the input trajectories commonly being subject to noise. It could be noted, that a common approach for dealing with such problems is given by adding an error term to each observed point
[15], which introduces additional hyperparameters. Examples for both of these cases are depicted in Fig.
4.Apart from the prediction refinement, the GP framework can be used in cases, where new measurements appear within the predicted time horizon. These can be used to update the prediction in post without having to generate a new prediction using the MDN. This last statement is especially valuable, as the MDN, as well as common prediction models, require complete trajectories as input, which would have to be extracted from the model’s initial prediction. As such, it is not backed up by any real measured data in the worst case. Related to this topic, receiving a new measurement after several time steps without a measurement grants the opportunity for making an informed estimate when using the GP framework for closing such gaps. An example for the posterior predictive distribution given an additional observation within the prediction time horizon is depicted in Fig. 5. While there are initially multiple relevant mixture components (according to their weighting), the additional observation leads to the suppression of wrong modes.
5 Summary
Throughout this paper a hybrid regressionbased and Bayesian probabilistic sequence model has been presented. The model builds upon a special case of Gaussian processes, which are derived from probabilistic Bézier curves generated by a Mixture Density network. This model therefore allows for Bayesian conditional inference without the need for variational approximation or Monte Carlo simulation. The viability of the model was examined in a trajectory forecasting setting, where the GP framework was applied in order to refine predictions generated by the regression network. Throughout these experiments an increase in prediction performance could be observed. Further, the model allows for prediction updates within the predicted time horizon when new observations are given. This update can be done without the need for performing another pass through the Mixture Density network.
References
 [1] (201203) Kernels for vectorvalued functions: a review. Found. Trends Mach. Learn. 4 (3), pp. 195–266. External Links: ISSN 19358237, Link, Document Cited by: §3.2.

[2]
(2018)
Accurate and diverse sampling of sequences based on a “best of many” sample objective.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 8485–8493. Cited by: §4.2.  [3] (2006) Pattern recognition and machine learning (information science and statistics). SpringerVerlag New York, Inc., Secaucus, NJ, USA. External Links: ISBN 0387310738 Cited by: §3.3, §4.1.
 [4] (1994) Mixture density networks. Cited by: §1, §3.3.
 [5] (1995) Neural networks for pattern recognition. Oxford university press. Cited by: §1.
 [6] (2015) Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. Cited by: §1.
 [7] (201608) Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 10–21. External Links: Link, Document Cited by: §1.
 [8] (2020) Remarks on multivariate gaussian process. arXiv preprint arXiv:2010.09830. Cited by: §3.2.
 [9] (2020) Multivariate gaussian and studentt process regression for multioutput prediction. Neural Computing and Applications 32 (8), pp. 3005–3028. Cited by: §3.2.
 [10] (2013) Deep gaussian processes. In Artificial Intelligence and Statistics, pp. 207–215. Cited by: §1.
 [11] (2015) Distributed gaussian processes. In International Conference on Machine Learning, pp. 1481–1490. Cited by: §3.3.
 [12] (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1), pp. 1–22. Cited by: §3.3.

[13]
(2016)
Dropout as a bayesian approximation: representing model uncertainty in deep learning
. In international conference on machine learning, pp. 1050–1059. Cited by: §1.  [14] (2014) Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2672–2680. Cited by: §1.
 [15] (2019) A visual exploration of gaussian processes. Distill 4 (4), pp. e17. Cited by: §3.1, §4.3.

[16]
(2013)
Generating sequences with recurrent neural networks
. arXiv preprint arXiv:1308.0850. Cited by: §1.  [17] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.
 [18] (2020) Introducing probabilistic bézier curves for nstep sequence prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 10162–10169. Cited by: §1, §2.2, §3.3, §4.1.
 [19] (2022) Probabilistic parametric curves for sequence modeling. Ph.D. Thesis, Karlsruher Institut für Technologie (KIT), Karlsruher Institut für Technologie (KIT), (english). External Links: Document Cited by: §1, §2.2, §3.3, §3, §4.1, §4.
 [20] (2019) The trajectron: probabilistic multiagent trajectory modeling with dynamic spatiotemporal graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2375–2384. Cited by: §4.2.
 [21] (1991) Adaptive mixtures of local experts. Neural computation 3 (1), pp. 79–87. Cited by: §3.3.
 [22] (2014) Autoencoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings, External Links: Link Cited by: §1.
 [23] (2021) Human trajectory forecasting in crowds: a deep learning perspective. IEEE Transactions on Intelligent Transportation Systems. Cited by: §4.2.
 [24] (2021) Gaussian processbased realtime learning for safety critical applications. In International Conference on Machine Learning, pp. 6055–6064. Cited by: §3.3.
 [25] (2007) Crowds by example. In Computer graphics forum, Vol. 26, pp. 655–664. Cited by: §4.2.
 [26] (2013) Bernstein polynomials. American Mathematical Soc.. Cited by: §2.2.
 [27] (2003) Information theory, inference and learning algorithms. Cambridge university press. Cited by: §3.
 [28] (2016) Recurrent gaussian processes. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §1.
 [29] (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §1.
 [30] (2009) You’ll never walk alone: modeling social behavior for multitarget tracking. In 2009 IEEE 12th International Conference on Computer Vision, pp. 261–268. Cited by: §4.2.
 [31] (2008) The matrix cookbook. Technical University of Denmark 7 (15), pp. 510. Cited by: §4.1.
 [32] (2002) Bézier and bspline techniques. Springer Science & Business Media. Cited by: §2.2.
 [33] (2006) Gaussian processes for machine learning. Adaptive computation and machine learning, MIT Press. External Links: ISBN 026218253X Cited by: §1, §1, §2.1, §3.
 [34] (2001) Infinite mixtures of gaussian process experts. Advances in neural information processing systems 14. Cited by: §3.3.
 [35] (2016) Learning social etiquette: human trajectory understanding in crowded scenes. In European conference on computer vision, pp. 549–565. Cited by: §4.2.
 [36] (2015) Learning structured output representation using deep conditional generative models. Advances in neural information processing systems 28, pp. 3483–3491. Cited by: §1.
 [37] (2000) Mixtures of gaussian processes. Advances in neural information processing systems 13. Cited by: §3.3.
 [38] (2017) Seqgan: sequence generative adversarial nets with policy gradient. In Proceedings of the AAAI conference on artificial intelligence, Vol. 31. Cited by: §1.
 [39] (2008) Variational mixture of gaussian process experts. Advances in neural information processing systems 21. Cited by: §3.3.