DeepAI
Log In Sign Up

Bézier Curve Gaussian Processes

05/03/2022
by   Ronny Hug, et al.
Fraunhofer
0

Probabilistic models for sequential data are the basis for a variety of applications concerned with processing timely ordered information. The predominant approach in this domain is given by neural networks, which incorporate either stochastic units or components. This paper proposes a new probabilistic sequence model building on probabilistic Bézier curves. Using Gaussian distributed control points, these parametric curves pose a special case for Gaussian processes (GP). Combined with a Mixture Density network, Bayesian conditional inference can be performed without the need for mean field variational approximation or Monte Carlo simulation, which is a requirement of common approaches. For assessing this hybrid model's viability, it is applied to an exemplary sequence prediction task. In this case the model is used for pedestrian trajectory prediction, where a generated prediction also serves as a GP prior. Following this, the initial prediction can be refined using the GP framework by calculating different posterior distributions, in order to adapt more towards a given observed trajectory segment.

READ FULL TEXT VIEW PDF

page 10

page 11

page 12

08/12/2019

Modeling continuous-time stochastic processes using N-Curve mixtures

Representations of sequential data are commonly based on the assumption ...
06/25/2012

Bayesian Modeling with Gaussian Processes using the GPstuff Toolbox

Gaussian processes (GP) are powerful tools for probabilistic modeling pu...
09/01/2021

Measuring Uncertainty in Signal Fingerprinting with Gaussian Processes Going Deep

In indoor positioning, signal fluctuation is highly location-dependent. ...
08/31/2022

Light curve completion and forecasting using fast and scalable Gaussian processes (MuyGPs)

Temporal variations of apparent magnitude, called light curves, are obse...
04/23/2020

Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming

Gaussian processes are powerful non-parametric probabilistic models for ...
05/30/2022

Posterior and Computational Uncertainty in Gaussian Processes

Gaussian processes scale prohibitively with the size of the dataset. In ...
06/16/2021

Leveraging Probabilistic Circuits for Nonparametric Multi-Output Regression

Inspired by recent advances in the field of expert-based approximations ...

1 Introduction

Models of sequential data play an integral role in a range of different applications related to representation learning, sequence synthesis and prediciton. Thereby, with real-world data often being subject to noise and detection or annotation errors, probabilistic sequence models are favorable. These take uncertainty in the data into account and provide an implicit or explicit representation of the underlying probability distribution.

The determination of such a probabilistic sequence model is commonly layed out as a learning problem, learning a model of an unknown underlying stochastic process from given sample sequences. Common approaches are based on either Gaussian Processes [33] (e.g. [10, 28]) or more prevalently on neural networks, i.e. approximate Bayesian neural models (e.g. Baysian Neural Networks [5, 6, 13]

, Variational Autoencoders

[22, 36, 7]

and Generative Adversarial Networks

[14, 29, 38]) or regression-based neural models based on Mixture Density Networks (MDN) [4] (e.g. [16]

). Approximate Bayesian neural models contain stochastic components and allow to sample from the modeled probability distribution. These models typically require Monte Carlo simulation or variational approximation during training and inference. On the other hand, MDNs are deterministic models, which parameterize a mixture distribution, making it potentially less expressive as a probabilistic model. This is due to the model merely learning to generate point estimates for the target distribution. Although MDNs are more stable and less computationally heavy during training, Monte Carlo simulation is still required for multi-modal inference. A potential drawback common to both approaches is given by sequences usually being processed and generated in an iterative fashion. As a consequence, outliers occurring during Monte Carlo simulation may have a negative impact on all subsequent time steps.

In order to tackle difficulties with multi-modal inference, Hug et al. [18, 19] proposed a variation of MDNs, which operate in the domain of parametric curves instead of the data domain, allowing to infer multiple time steps in a single inference step. The model is built on a probabilistic extension of Bézier curves (-Curves), which themselves pose a special case for Gaussian processes (GP, [33]). Following this, this paper aims to exploit this relationship in order to reintroduce some of the expressiveness without losing the advantages of the regression-based model. This is done by deriving the Gaussian process induced by an

-Curve (mixture) as generated by the MDN. In terms of the GP, the MDN then generates its prior distribution, from which different posterior predictive distributions can be calculated under the presence of data. This, in turn, can for example be used for refining a prediction generated by the MDN in a sequence prediction task. This use case is considered for evaluating the viability of the combined model. Following this, the main contributions of this paper are:

  • The derivation of a new mean and covariance function for a GP derived from -Curves, covering the -variate as well as the multi-modal case.

  • A combination of regression-based probabilistic sequence prediction and GP-based prediction refinement.

2 Preliminaries

2.1 Gaussian Processes

A Gaussian process (GP, [33]) is a stochastic process with index set

, where the joint distribution of all stochastic variables

is a multivariate Gaussian distribution. For simplicity the index set will be interpreted as time throughout this paper. The joint distribution is obtained using an explicit mean function and positive definite covariance function

, commonly referred to as the kernel of the Gaussian process, and yields a multivariate Gaussian prior probability distribution over function space. Commonly,

is assumed. Given a collection of sample points of a function , the posterior (predictive) distribution modeling non-observed function values can be obtained through conditioning. As such, Gaussian processes provide a well-established model for probabilistic sequence modeling.

2.2 Probabilistic Bézier Curves

Probabilistic Bézier Curves (-Curves, [18, 19]) are Bézier curves [32] defined by independent -dimensional Gaussian control points with . Through the curve construction function

(1)

with

(2)

and

(3)

where are the Bernstein polynomials [26], the stochasticity is passed from the control points to the curve points , yielding a sequence of Gaussian distributions along the underlying Bézier curve. Thus, a stochastic process with index set can be defined. For representing discrete data, i.e. sequence of length , a discrete subset of can be employed for connecting sequence indices with evenly distributed values in , yielding

(4)

3 -Curve Gaussian Processes

With -Curves providing a representation for stochastic processes

comprised of Gaussian random variables

, it can be shown, that -Curves are a special case of GPs using an implicit covariance function. Following the definition of GPs [27, 33], an

-Curve can be classified as a GP, if for any finite subset

of , the joint probability density of corresponding random variables is Gaussian. This property is referred to as the GP property

and can be shown to hold true by reformulating the curve construction formula into a linear transformation

111For clarity, multivariate random variables may be written in bold font occasionally.

of the Gaussian control points stacked into a vector

(5)

using a transformation matrix

(6)

determined by the Bernstein polynomials, with and [19]. As is itself a Gaussian random vector,

is again Gaussian with its corresponding probability density function

being a Gaussian probability density.

As the Gaussians along an -Curve are correlated through the use of common control points and the way the curve is constructed, the kernel function can also be given explicitly. The following sections provide mean and kernel functions for the univariate, multivariate and multi-modal case.

3.1 Univariate Gaussian Processes

Univariate GPs targeting scalar-valued functions are considered first being the most common use case. Further, it grants a simple case for deriving the mean and kernel functions induced by a given -Curve while also allowing a visual examination of some properties of the induced GP, denoted as -GP in the following. Here, the stochastic control points are thus defined by the mean value

and variance

. The mean function is equivalent to Eq. 2. Thus, this section focuses on the kernel function for two curve points and at indices and with . The respective mean values are given by and . From now follows:

With , which follows from the independence of the control points, and , follows the closed-form solution

(7)

Considering the kernel function and the resulting Gram matrix, the -GP is heavily dependent on the given set of control points, allowing for a range of different kernels. Fig. 1 illustrates a radial basis function (RBF, [15]) kernel

(8)

with and , a linear kernel [15]

(9)

with , and two -GP kernels and . consists of two unit Gaussians, i.e. , and consists of

zero mean Gaussian control points with standard deviations

, , , and . Here, standard deviations vary in order to cope with non-linear blending (see Eq. 3). Fig. 1 depicts the Gram matrix calculated from equally spaced values in for each kernel.

Figure 1: Gram matrices for equally spaced values in obtained by using different GP kernels. Left to right: RBF kernel, linear kernel and -GP kernels and .

When comparing the Gram matrices, it can be seen, that the Gram matrix calculated with is equal to that calculated with when normalizing its values to . On the other hand, the Gram matrix obtained with , which is derived from a more complex -Curve, tends to be more comparable to the Gram matrix calculated with .

Assuming a zero mean GP, each kernel function defines a prior distribution. Following this, Fig. 2 depicts sample functions drawn from each prior distribution, again showing the parallels between the kernels.

Figure 2: Samples drawn from prior distributions using different GP kernels. The region is depicted as a red shaded area. Left to right: RBF kernel, linear kernel and -GP kernels and .

As a final note, the -GP is non-stationary, as its kernel function depends on the actual values of its inputs and . Further, it is non-periodic and its smoothness is controlled by the underlying Bézier curve, i.e. the position and number of control points.

3.2 Multivariate Gaussian Processes

Multivariate GPs target vector-valued functions , which map scalar inputs onto -dimensional vectors, e.g.

. In this case, there exist two closely related approaches. The first revolves around the matrix normal distribution

[8, 9]. The other sticks with the multivariate Gaussian distribution and models vector-valued function by using stacked mean vectors in combination with block partitioned covariance matrices [1]. Their relationship stems from the fact, that a matrix normal distribution can be transformed into a multivariate Gaussian distribution by vectorizing, i.e. stacking, the mean matrix vectors and calculating the covariance matrix as the Kronecker product of both scale matrices.

The second approach is considered in the following, granting a more straightforward extension for the univariate case. Following this, the Gram matrix of a -variate GP for a finite index subset with is given by the block partitioned matrix

(10)

calculated using the matrix-valued kernel function . Here, and are now -variate Gaussian random variables resulting from the linear combination of -variate -Curve control points . The kernel function

(11)

is thus the multivariate generalization to the function given in Eq. 7 and yields a matrix.

The mean vector is defined as the concatenation of all point means

(12)

where is the mean function according to Eq. 2 in the -Curve definition.

3.3 Multi-modal Gaussian Processes

With sequence modeling tasks often being multi-modal problems and GPs as presented before being incapable of modeling such data, multi-modal GPs are considered as a final case. A common approach to increasing the expressiveness of a statistical model, e.g. for heteroscedasticity or multi-modality, is given by mixture modeling approaches. Thereby, rather than a single model or distribution a mixture of which are used, each component in the mixture covering a subset of the data. Generaly speaking, a widely used mixture model is given by the Gaussian mixture model

[3], which is defined as a convex combination of Gaussian distributions with mixing weights and probability density function

(13)

In the case of GPs, a popular approach is given by the mixture of Gaussian process experts [37, 34, 39], which extends on the mixture of experts model [21]. In this approach, the mixture model is comprised of a mixture of GP experts (components) with mean function and kernel function

(14)

weighted using a conditional mixing weight distribution for a given sample . The weight distribution is generated by a gating network, which decides on the influence of each local expert for modeling a given sample. This is the key difference to the Gaussian mixture model, where the mixing weight distribution is determined a priori (e.g. via EM [12] or an MDN [4]). It can be noted that the mixture of experts approach is also oftentimes used to lower the computational load of a GP model, as less data points have to be considered during inference due to the use of local experts (e.g. [11, 24]).

In line with the mixture of -Curves approach given in [18, 19], which builds on Gaussian mixture models, the multi-modal extension of the -GP is defined as a mixture of -GPs

(15)

where are the local GPs and with is the prior distribution over the mixing weights. This approach yields a GP prior, which is already adapted towards the data basis when setting the -Curve parameters accordingly.

4 Experiments

In this section, the -GP model is examined considering human trajectory prediction as an exemplary sequence prediction task. This task provides easy to interpret and visualize results while also providing a lot of complexity being a highly multi-modal problem, despite low data dimensionality. In human trajectory prediction, given points of a trajectory as input, a sequence model is tasked to predict the subsequent trajectory points. This section focuses on the examination of the -GP model as a refinement component and omits a state-of-the-art comparison, as the underlying -Curve model has been proven competetive on the given task in [19]. For this, the GP posterior distribution is calculated for different observed sequence points, which is expected to adapt the initial MDN prediction towards the actual test sample.

4.1 Parameter Estimation and Conditional Inference

For estimating the parameters of the -GP prior distribution an MDN, which maps an input vector onto the parameters of a -component -Curve mixture, is used. In the context of sequence prediction, a common approach is to use a (recurrent) sequence encoder, such as an LSTM [17], for encoding an input sequence into , which is then passed through the MDN. For training the MDN using a set of fixed-length trajectories with

, the negative log-likelihood loss function

(16)

can be applied in conjunction with a gradient descent policy [18, 19]. Although using an MDN merely provides a point estimate of the underlying data generating distribution, this is no major drawback in the context of -GPs, as it solely serves the purpose of estimating a GP prior distribution, which can then be adapted towards the data basis.

After training, the MDN can be used for generating ML estimates of an input trajectory segment in combination with possible future trajectories defined in terms a -Curve mixture. Using Equations 11, 12 and 15, this mixture additionally provides the -GP prior distribution. In the context of this experimental section, a subset of the input is then used for calculating the -GP posterior in order to tailor the mean prediction generated by the MDN to a given test trajectory. As the prior is a joint Gaussian mixture distribution over all modeled time steps it can be partitioned into a partition containing the time steps to condition on and the remaining time steps. Following this, a closed-form solution exists for the posterior weights, mean vectors and covariance matrices (see e.g. [3, 31]). The probability distribution for each individual trajectory point can then be extracted through marginalization.

4.2 Experimental Setup

For this evaluation, scenes from commonly used datasets are considered: BIWI Walking Pedestrians ([30], scenes: ETH and Hotel), Crowds by Example ([25], scenes: Zara1 and Zara2) and the Stanford Drone Dataset ([35], scenes: Bookstore and Hyang). Following common practice, the annotation frequency of each dataset is adjusted to annotations per second. Further, the evaluation is conducted on trajectories of a fixed length . The MDN is trained independently on each dataset to generate -Curve mixture estimates for complete trajectories of length . These initial predictions are then refined by calculating the posterior predictive distributions conditioned on the last input trajectory point (posterior A) and on (posterior B). Using an increasing number of observations is expected to adapt the prediction towards a given trajectory sample. For measuring the performance, the Average Displacement Error (ADE, [30, 23]) is applied according to the standard evaluation approach, using a maximum likelihood estimate. As the ADE does not provide an adequate measure for assessing the quality of (multi-modal) probabilistic predictions, the Negative (data) Log-Likelihood (NLL) is a common choice [2, 20] and will be used in addition to the ADE.

Prior Posterior A Posterior B
ETH ML-ADE 3.85 / 11.25 3.95 / 10.12 2.39 / 10.18
NLL 6.51 / 7.58 5.43 / 7.08 -115.09 / 1.70
Hotel ML-ADE 5.69 / 17.96 4.19 / 17.07 2.70 / 16.73
NLL 6.99 / 8.20 5.56 / 7.71 10.63 / 8.59
Zara1 ML-ADE 4.09 / 19.10 2.89 / 17.52 1.63 / 17.64
NLL 6.83 / 8.18 5.27 / 7.63 -51.93 / 8.15
Zara2 ML-ADE 2.98 / 21.38 2.64 / 20.07 1.69 / 20.05
NLL 6.59 / 8.09 5.08 / 7.59 -60.95 / -1.76
Bookstore ML-ADE 4.04 / 17.21 3.65 / 15.97 2.16 / 16.29
NLL 7.46 / 8.37 5.88 / 7.76 -11.89 / 7.63
Hyang ML-ADE 5.51 / 36.46 5.01 / 34.05 3.16 / 32.18
NLL 8.21 / 9.42 6.65 / 8.86 -49.30 / 9.05
Table 1: Quantitative results of the (prior) prediction as generated by an MDN and posterior refinements. As the predictions also cover the input sequence, error values for the input portion are reported additionally (first value). ADE errors are reported in pixels. Lower is better.

4.3 Results

The quantitative results with respect to the selected performance measures are depicted in Table 1. Overall, an increase in performance can be observed when refining the estimate generated by the regression-based neural network using and observed points, respectively. This is true for both the input and to predict portion of the modeled trajectory. Two examples highlighting common cases for a positive effect of the refinement on the estimate is given in Fig. 3. On the one hand, the refinement can lead to the estimate being pulled closer to the ground truth in the input portion, which expands far into the future prediction (first row). On the other hand, the refinement can lead to the suppression of inadequate mixture components in the posterior predictive distribution, which have had high weights assigned to them in the prior distribution (second row).

Figure 3: Exemplary cases for improved trajectory prediction through conditioning on or observed points, respectively. Left to right: Prior, posterior A and posterior B. Condition points are indicated by a yellow square. The full ground truth trajectory is depicted in semi-transparent black.

Besides the overall performance, it can be seen that in some instances conditioning on

observations (posterior distribution B) degrades the performance in comparison to using a single observation (posterior distribution A). With respect to the NLL, this can be attributed to an increased number of trajectory point variances decreasing or even collapsing, whereby trajectory points closer to the observed points are more affected. In this case, even minor inaccuracies in the mean prediction result in higher NLL values, even if the estimate is closer to the ground truth. Looking at the ADE, the loss in performance can most likely be attributed to the enforced interpolation of the observed points leading to unwanted deformations of the mean prediction. One of the main causes for this is given by the input trajectories commonly being subject to noise. It could be noted, that a common approach for dealing with such problems is given by adding an error term to each observed point

[15]

, which introduces additional hyperparameters. Examples for both of these cases are depicted in Fig.

4.

Figure 4: Common cases for posterior distribution leading to a degrade in prediction performance according to the NLL (top) and ADE (bottom). Left to right: Prior, posterior A and posterior B. Condition points are indicated by a yellow square. The full ground truth trajectory is depicted in semi-transparent black.

Apart from the prediction refinement, the GP framework can be used in cases, where new measurements appear within the predicted time horizon. These can be used to update the prediction in post without having to generate a new prediction using the MDN. This last statement is especially valuable, as the MDN, as well as common prediction models, require complete trajectories as input, which would have to be extracted from the model’s initial prediction. As such, it is not backed up by any real measured data in the worst case. Related to this topic, receiving a new measurement after several time steps without a measurement grants the opportunity for making an informed estimate when using the GP framework for closing such gaps. An example for the posterior predictive distribution given an additional observation within the prediction time horizon is depicted in Fig. 5. While there are initially multiple relevant mixture components (according to their weighting), the additional observation leads to the suppression of wrong modes.

Figure 5: Example for updating the prediction generated by an MDN (left) using the last observed trajectory point (center) and an additional observation within the prediction time horizon (right). Condition points are indicated by a yellow square. The full ground truth trajectory is depicted in semi-transparent black.

5 Summary

Throughout this paper a hybrid regression-based and Bayesian probabilistic sequence model has been presented. The model builds upon a special case of Gaussian processes, which are derived from probabilistic Bézier curves generated by a Mixture Density network. This model therefore allows for Bayesian conditional inference without the need for variational approximation or Monte Carlo simulation. The viability of the model was examined in a trajectory forecasting setting, where the GP framework was applied in order to refine predictions generated by the regression network. Throughout these experiments an increase in prediction performance could be observed. Further, the model allows for prediction updates within the predicted time horizon when new observations are given. This update can be done without the need for performing another pass through the Mixture Density network.

References

  • [1] M. A. Álvarez, L. Rosasco, and N. D. Lawrence (2012-03) Kernels for vector-valued functions: a review. Found. Trends Mach. Learn. 4 (3), pp. 195–266. External Links: ISSN 1935-8237, Link, Document Cited by: §3.2.
  • [2] A. Bhattacharyya, B. Schiele, and M. Fritz (2018) Accurate and diverse sampling of sequences based on a “best of many” sample objective. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 8485–8493. Cited by: §4.2.
  • [3] C. M. Bishop (2006) Pattern recognition and machine learning (information science and statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA. External Links: ISBN 0387310738 Cited by: §3.3, §4.1.
  • [4] C. M. Bishop (1994) Mixture density networks. Cited by: §1, §3.3.
  • [5] C. M. Bishop (1995) Neural networks for pattern recognition. Oxford university press. Cited by: §1.
  • [6] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. Cited by: §1.
  • [7] S. R. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio (2016-08) Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 10–21. External Links: Link, Document Cited by: §1.
  • [8] Z. Chen, J. Fan, and K. Wang (2020) Remarks on multivariate gaussian process. arXiv preprint arXiv:2010.09830. Cited by: §3.2.
  • [9] Z. Chen, B. Wang, and A. N. Gorban (2020) Multivariate gaussian and student-t process regression for multi-output prediction. Neural Computing and Applications 32 (8), pp. 3005–3028. Cited by: §3.2.
  • [10] A. Damianou and N. Lawrence (2013) Deep gaussian processes. In Artificial Intelligence and Statistics, pp. 207–215. Cited by: §1.
  • [11] M. Deisenroth and J. W. Ng (2015) Distributed gaussian processes. In International Conference on Machine Learning, pp. 1481–1490. Cited by: §3.3.
  • [12] A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39 (1), pp. 1–22. Cited by: §3.3.
  • [13] Y. Gal and Z. Ghahramani (2016)

    Dropout as a bayesian approximation: representing model uncertainty in deep learning

    .
    In international conference on machine learning, pp. 1050–1059. Cited by: §1.
  • [14] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2672–2680. Cited by: §1.
  • [15] J. Görtler, R. Kehlbeck, and O. Deussen (2019) A visual exploration of gaussian processes. Distill 4 (4), pp. e17. Cited by: §3.1, §4.3.
  • [16] A. Graves (2013)

    Generating sequences with recurrent neural networks

    .
    arXiv preprint arXiv:1308.0850. Cited by: §1.
  • [17] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.1.
  • [18] R. Hug, W. Hübner, and M. Arens (2020) Introducing probabilistic bézier curves for n-step sequence prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 10162–10169. Cited by: §1, §2.2, §3.3, §4.1.
  • [19] R. Hug (2022) Probabilistic parametric curves for sequence modeling. Ph.D. Thesis, Karlsruher Institut für Technologie (KIT), Karlsruher Institut für Technologie (KIT), (english). External Links: Document Cited by: §1, §2.2, §3.3, §3, §4.1, §4.
  • [20] B. Ivanovic and M. Pavone (2019) The trajectron: probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2375–2384. Cited by: §4.2.
  • [21] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton (1991) Adaptive mixtures of local experts. Neural computation 3 (1), pp. 79–87. Cited by: §3.3.
  • [22] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, External Links: Link Cited by: §1.
  • [23] P. Kothari, S. Kreiss, and A. Alahi (2021) Human trajectory forecasting in crowds: a deep learning perspective. IEEE Transactions on Intelligent Transportation Systems. Cited by: §4.2.
  • [24] A. Lederer, A. J. O. Conejo, K. A. Maier, W. Xiao, J. Umlauft, and S. Hirche (2021) Gaussian process-based real-time learning for safety critical applications. In International Conference on Machine Learning, pp. 6055–6064. Cited by: §3.3.
  • [25] A. Lerner, Y. Chrysanthou, and D. Lischinski (2007) Crowds by example. In Computer graphics forum, Vol. 26, pp. 655–664. Cited by: §4.2.
  • [26] G. G. Lorentz (2013) Bernstein polynomials. American Mathematical Soc.. Cited by: §2.2.
  • [27] D. J. MacKay and D. J. Mac Kay (2003) Information theory, inference and learning algorithms. Cambridge university press. Cited by: §3.
  • [28] C. L. C. Mattos, Z. Dai, A. C. Damianou, J. Forth, G. A. Barreto, and N. D. Lawrence (2016) Recurrent gaussian processes. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §1.
  • [29] M. Mirza and S. Osindero (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §1.
  • [30] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool (2009) You’ll never walk alone: modeling social behavior for multi-target tracking. In 2009 IEEE 12th International Conference on Computer Vision, pp. 261–268. Cited by: §4.2.
  • [31] K. B. Petersen and M. S. Pedersen (2008) The matrix cookbook. Technical University of Denmark 7 (15), pp. 510. Cited by: §4.1.
  • [32] H. Prautzsch, W. Boehm, and M. Paluszny (2002) Bézier and b-spline techniques. Springer Science & Business Media. Cited by: §2.2.
  • [33] C. E. Rasmussen and C. K. I. Williams (2006) Gaussian processes for machine learning. Adaptive computation and machine learning, MIT Press. External Links: ISBN 026218253X Cited by: §1, §1, §2.1, §3.
  • [34] C. Rasmussen and Z. Ghahramani (2001) Infinite mixtures of gaussian process experts. Advances in neural information processing systems 14. Cited by: §3.3.
  • [35] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese (2016) Learning social etiquette: human trajectory understanding in crowded scenes. In European conference on computer vision, pp. 549–565. Cited by: §4.2.
  • [36] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. Advances in neural information processing systems 28, pp. 3483–3491. Cited by: §1.
  • [37] V. Tresp (2000) Mixtures of gaussian processes. Advances in neural information processing systems 13. Cited by: §3.3.
  • [38] L. Yu, W. Zhang, J. Wang, and Y. Yu (2017) Seqgan: sequence generative adversarial nets with policy gradient. In Proceedings of the AAAI conference on artificial intelligence, Vol. 31. Cited by: §1.
  • [39] C. Yuan and C. Neubauer (2008) Variational mixture of gaussian process experts. Advances in neural information processing systems 21. Cited by: §3.3.