1 Introduction
SciSports (http://www.scisports.com/) is a Dutch sports analytics company taking a datadriven approach to football. The company conducts scouting activities for football clubs, gives advice to football players about which football club might suit them best, and quantifies the abilities of football players through various performance metrics. So far, most of these activities have been supported by either coarse event data, such as lineups and outcomes of matches, or more finegrained event data such as completed passes, distances covered by players, yellow cards received and goals scored.
In the long term, SciSports aims to install specialized cameras and sensors across football fields to create a two and threedimensional virtual rendering of the matches, by recording players’ coordinate positions and gait data in millisecond time intervals. From this massive amount of data, SciSports is interested in predicting future game courses and extracting useful analytics. Insights gained from this learning process can be used as preliminary steps towards determining the quality and playing style of football players. In this project we based our work on a dataset containing the recorded twodimensional positions of all players and the ball during 14 standard football matches at second time intervals.
Football kinematics such as acceleration, maximal sprinting speed and distance covered during a match can be extracted automatically from trajectory data. However, there are also important unobservable factors/features determining the soccer game, e.g., a player can be of enormous value to a game without being anywhere near the ball. These latent factors are key to understanding the drivers of motion and their roles in predicting future game states. There are in general two basic approaches to uncovering these factors: we can either postulate a model or structure for these factors, based on physical laws and other domain knowledge (modelbased), or we can use machine learning techniques and let the algorithms discover these factors on their own (datadriven).
Modelbased approaches have been widely used to analyze football trajectories. Examples in the literature include statistical models such as state space models Yu et al. (2003a, b); Ren et al. (2008) and physical models based on equations of motion and aerodynamics Goff and Carré (2009). These methods have the advantage of producing interpretable results and they can quickly give reasonable predictions using relatively few past observations. In Section 3.1, we build state space models based on principles of Newtonian mechanics to illustrate these approaches.
The need to specify an explicit model is a drawback, however, since human players probably follow complicated rules of behavior. To this end, datadriven approaches embody the promise of taking advantage of having large amounts of data through machine learning algorithms, without specifying the model; in a sense the model is chosen by the algorithm as part of the training.
We implemented a Variational Autoencoder (VAE), as introduced by Kingma and Welling (2013), and a Generative Adversarial Net (GAN) as developed in Goodfellow et al. (2014).
The paper is organized as follows. In the next section, we describe the twodimensional positional data used for our analyses. We present the modelbased statespace approach in Section 3 and the datadriven methods based on GANs and VAEs in Sections 4.1 and 4.2, respectively. We introduce the discriminator network to differentiate movements in 4.3. We conclude in Section 5 and discuss future work.
The R and Python codes used to reproduce all our analyses can be found in https://bitbucket.org/AnatoliyBabic/swiscisports2018.
2 The data
The data that we used for this project was provided by SciSports and is taken from complete minute football matches. For each player and ball ( entities total) the coordinates on the field have been recorded with a resolution of cm and frames per second; i.e., the trajectory of a player on a seconds timespan corresponds to a
vector of
coordinates. The field measures by meters, and the origin of the coordinate system is the center of the pitch. For all football fields illustrated in this report, the dimensions are given in centimeters, which means that the field corresponds to the rectangle .For illustration, Figure 1 shows a singletime snapshot of the positional data for the ball and all players.
3 Methods: modelbased
In this section we describe a modelbased approach to extract information from the data. With this approach we have two goals: first, to extract velocities from the position data in such a way that the impact of the noise in position measurements is minimized, and secondly, to estimate acceleration profiles of different players.
3.1 Newtonian mechanics and the Kalman filter
A single football player
We first consider the case of modeling the movement of one football player in the first match. We assume that this player is not a goalkeeper, since we would like to model movement ranges that span at least half the field. The data provides a player’s position at every fixed milliseconds as long as he remains in the game. Let be the time difference between successive timesteps, and let us denote a player’s position in the plane at timestep as , with the velocity and acceleration as and ; they are related by and . By approximating these derivatives by finite differences we obtain
(3.1) 
We now model the acceleration . We assume that at each timestep the acceleration
is independently and normally distributed with mean
and unknown covariance matrix (we write this as ). Since acceleration is proportional to force by Newton’s second law of motion, this induces a normal distribution on the corresponding force exerted by the player, and the exponential decay of its tails translate to natural limits imposed on muscular work output.In view of (3), we take position and velocity as our underlying state vector, and we consider the following model:
(3.2)  
(3.3) 
In the state equation (3.2), the state vector propagates forward in time according to the Newtonian dynamics of (3), driven by an acceleration . In the observation equation (3.3), the observed quantity records the player’s position and not his/her velocity, and we assume that these position data are recorded with Gaussian measurement errors: with . We initialize and we assume that , and are mutually independent, and independent across different times.
We use a Kalman filter to integrate this model with the measurements; this should lead to an estimate for the velocity that is less noisy than simply calculating finite differences. However, the Kalman filter parameters depend on the noise levels as characterized by the player’s acceleration variance
and the measurement error parameters , and these we do not know; therefore we combine the Kalman filter with parameter estimation.In each Kalmanfilter timestep we assume that we have access to observations , and we compute the onestep state prediction and its error , in conjunction with their estimated covariance matrices and . The Kalman recursion formulas for these calculations are given by (see Appendix A of Helske, 2017)
(3.4a)  
(3.4b) 
where . For given values of and this leads to time courses of the state , the covariance , and the derived quantities and .
We have a total of unknown parameters in our state space model, i.e., the two diagonal entries of and all the entries of (we did not exploit the symmetry of ). Given the result of a calculation for given and , the loglikelihood function (Helske, 2017) is given by
(3.5) 
where is the dimension of at a fixed , which in our present case is . We then compute the maximum likelihood estimator for the covariance parameters using the BroydenFletcherGoldfarbShanno (BFGS) optimization algorithm.
This setup leads to the following multilevel iteration.

We select the first 10 timesteps from the data; this means that we know the values of to .

At the outer level we maximize the loglikelihood function (3.5) with respect to and .

At the inner level, i.e. for each evaluation of the loglikelihood, we run the Kalman filter (3.4) for 10 steps, ending at time .

After completing the optimization over and for this choice of 10 timesteps, we have both an estimate of and during that period and a prediction for , for . We then shift the 10step window by one timestep, to , and go back to step 2.
At the end of this process, we have for each 10step window of times a series of estimates of , , , , and .
Remark 1.
Each of the 11step runs of the Kalman filter equations (3.4) needs to be initialized. We initialize randomly, drawn from , as mentioned above. Concerning the choice of , a commonly used default is to set as a diffuse prior distribution. However, this is numerically unstable and prone to cumulative roundoff errors. Instead, we use the exact diffuse initialization method by decomposing into its diffusive and nondiffusive parts; for more details see Koopman and Durbin (2003).
Remark 2.
In actual implementation, some technical modifications are needed to speed up computations, particularly when consists of highdimensional observations at each time point (which happens when we estimate all 23 entities, as we do below). To solve for this dimensionality issue and to avoid direct inversion of , the state space model of (3.3) and (3.2) is recast into an equivalent univariate form and the latent states are estimated using a univariate Kalman filter (cf. Koopman and Durbin, 2000).
The Kalman filter algorithm and parameter estimation (including the univariate formulation and diffuse initialization) were performed using the KFAS package (see Helske, 2017) in the R software package.
Results for a single player
We modeled the movement of the player with number 3, who appears to hold the position of left central midfielder, and who was in the pitch for the entire game. As described above, we use a sliding window of training samples for predictions, such that we first use time points to predict the th point (onestepahead), then we shift the window one timestep ahead and use the next time points to predict the th point and so on.
Figure 2 shows onestepahead predicted positions of our midfielder (blue dots) for the first 2500 time points. We see that the state space model is able to make accurate predictions (when compared to the red true positions), even if we have used only the past locations in our algorithm. Moreover, the model is able to trace out complicated movements and sharp corners as is evident from the figure.
As mentioned above, one reason for applying a Kalman filter to the data is to extract the velocity. Figure 3 illustrates the velocity vectors as arrows tangent to the position curve. We also plot the scalar speeds against the 2500 time points in Figure 4.
To see the correspondence between these three figures, let us focus on a distinguishing stretch of movement made by our midfielder, who starts at , then sprints towards the goal post in the East, make two loops towards the North and again moved back across the field to the West, thus making a somewhat elongated rectangle on the field. We know that he is sprinting to the goal from Figure 3 due to the long arrows pointing to the East, with exact magnitudes given by the peak slightly after time in Figure 4. The midfielder has relatively lower speeds when making the double loop (from time to in Figure 4) and then he picks up the momentum when moving towards the West, as is evident from the marked increase in speeds after time .
Figure 5 shows the predictive performance of this model for longer time horizons; in this case we are using time points to predict steps ahead. When compared with the onestepahead case of Figure 2, we see that there is some deterioration in this model’s predictive capability, particularly for places where the player’s trajectory is curved. From this plot, we can deduce that positional uncertainties are the greatest when the midfielder is moving in loops or in circles.
Results for the ball and all football players
Let us now consider the general case of modeling all football players, including goalkeepers, and the ball (collectively called ‘entities’). A snapshot of the positional data at around minutes into the game is shown in Figure 1. We choose the same equations for all entities, giving for all ,
(3.6) 
By stacking up copies of the single player case (3.3) and (3.2), we convert the equations of motion above to the following state space model:
with measurement vector
Here the measurement error vector is with and the acceleration vector .
It would be interesting to use this framework to model the interactions between different football players and the ball through the covariance matrix ; obviously, in a real match one expects a strong correlation between all entities. An unstructured consists of parameters and adding the diagonal elements of yields a total of parameters. We found that this general case takes a prohibitively long time to optimize, and we have to simplify the problem by imposing additional structure on . To keep computations manageable, we disregard correlations between entities, by assuming that is a block diagonal matrix given by where for . In other words, each player’s movement is modeled using his/her own state space equations that are independent of the other players.
If the prediction horizon is short, e.g., one step ahead, we found that this choice of gives reasonable predictive performance as shown in Figure 6. Here we have used past time points to predict one timestep ahead and we see that the onestepahead predicted player’s position (blue) closely follows the truth (red) over the span of time points. Moreover, the path of the ball is instantly recognizable as the zigzag dotted line (due to it being the fastest object) embedded among the network of trajectories. If longer prediction horizons are sought, then this simplifying assumption might not give good performance and crosscovariance terms between players and ball are needed. To that end, one can consider lowrank approximations or imposing sparsity constraints on
. Alternatively, we can turn to machinelearning methods by training a (deep) multilevel neural network to learn these complex interactions; this is the subject of the next section.
4 Methods: datadriven
In this section we describe machinelearning techniques to model spatiotemporal trajectories of players and the ball throughout the game, in order to acquire meaningful insight on football kinematics. Our philosophy is that we aim to construct networks that can generate trajectories that are statistically indistinguishable from the actual data. Successfully trained networks of this type have a number of benefits. They allow one to quickly generate more data; the components of such networks can be reused (we show an example in Section 4.3); when they produce ‘latent spaces’, then these latent spaces may be interpreted by humans; and the structure of succesful networks and the values of the trained parameters should, in theory, give information about the trajectories themselves.
In Section 4.1, we use Generative Adversarial Networks, such that two networks are pitted against each other to generate trajectories. Next, in Section 4.2, we consider another class of networks called Variational Autoencoders, where we do data compression and train the network to replicate trajectories by learning important features. Finally, in Section 4.3 we investigate a method to discriminate between walking patterns of two different football players.
4.1 Generative Adversarial Network
Generative Adversarial Networks (GANs) are deep neural net architectures introduced by Goodfellow et al. (2014) which exploit the competition between two (adversarial) networks: a generative network called the Generator and a discriminative network called the Discriminator.
Both the Generator and Discriminator are trained with a training set of real observations, and against each other. The Discriminator is a classifier; it has to learn to differentiate between real and generated observations, labeling them as “realistic” and “fake” respectively. The Generator, on the other hand, has to learn to reproduce features of the real data and generate new observations which are good enough to fool the Discriminator into labeling them as “realistic”.
2D positional data into images
GANs have been used with great success in image recognition, 3Dmodels reconstruction and photorealistic imaging; see e.g. Karazeev (2017). Because of the limited time available to us, we decided to capitalize on existing codes for images; we use Bruner and Deshpande (2017). By rescaling the data accordingly we map the football field to the square and interpret a seconds trajectory as a
grayscale image: for each of the 100 time points, the two degrees of freedom indicate the rescaled
 and positions. This “image” is what we input into the neural network machinery.Network setup
The algorithm we use is a repurposed version of the basic convolutional neural network found at
Bruner and Deshpande (2017), which is meant to recognize and reproduce handwritten digits. There is a structural difference between the two:
the original algorithm works with the MNIST digit dataset, which consists of blackandwhite images of possible states (the digits );

our algorithm works with grayscale images, containing an aggregation of seconds of play.
If we were to convert our grayscale images to blackandwhite, we would lose too much information.
Another important difference is in the intrinsic asymmetry of the data:

in the original version, both the Discriminator and the Generator look at or spatial features of the images: useful information about the topology of the shape can be obtained by looking at spatial neighborhoods of any given pixel;

in our case we want to look a the and coordinates independently, therefore our Discriminator and Generator work with onedimensional temporal features: the information regarding the  or trajectory in a temporal neighborhood of each position, i.e., its recent past and future. The information about the recent past and future of the trajectory should not be too small, otherwise the feature only observes the position of a player. On the other hand, if the feature is too large, it observes almost the entire 10second trajectory, and the trajectory only contains a few features. To balance this tradeoff we use and temporal features.
By making this tweak to the original algorithm we exploit the natural directionality of the data and we avoid overlapping the spatial properties (i.e., the shade of gray) and the temporal properties (i.e., the variation in shade). To have a sense of what this means we visualize the correspondence between the coordinates and the real trajectory of a player, see Figure 7.
The algorithm
We limit our training set to all random samplings of 20second trajectories of any single player (excluding goalkeepers and the ball) during a single fixed match. This should give some extra structure for the network to work with while maintaining a diverse enough data sample.
The initialization of the parameters is the same as in the original algorithm, the Generator takes a standard Gaussian noise vector as input and then produces a new image based on the updates made by the network. To have a glance of what an untrained Generator is capable of, see Figure 8.
The Discriminator is then pretrained with real and generated trajectories. After this first training epoch, the Discriminator is able to correctly discriminate between the real trajectories and the untrained noisy ones produced by the Generator. Here an epoch consists of one full learning cycle on the training set. Then the main training session begins. From the second epoch and above, the Discriminator is trained with real and generated data and the Generator itself is trained against the Discriminator. This produces a GeneratorDiscriminator feedback loop that forces both networks to improve themselves with the objective to outperform the other. This is achieved by implementing a loss function to measure three quantities:

Discriminator loss vs real: it measures how far the Discriminator is from labeling a real trajectory as “realistic”;

Discriminator loss vs Generator: it measures how far the Discriminator is from labeling a generated image as “fake”;

Generator loss vs Discriminator: it measures how far the Discriminator is from labeling a generated image as “realistic”.
The first loss function deals with the interaction between the Discriminator and the real world, it makes sure that the network is adapting to recognize new real observations. The second and third loss functions on the other hand, work against each other: one is trying to force the Discriminator to always label “fake” when presented with a generated image, while the other is forcing the Generator to produce data that mimics the Discriminator’s perception of the real world. The loss function used throughout the algorithm is the crossentropy loss, for a discussion see Seita (2017).
Performance and limitations
Properly training a GAN requires a long time and much can go wrong in the process. The Generator and Discriminator need to maintain a perfect balance, otherwise one will outperform the other causing either the Discriminator to blindly reject any generated image, or the Generator to exploit blind spots the Discriminator may have. After a training session of hours our GAN managed to go from random noise trajectories to smooth and structured ones, although not fully learning the underlying structure of the data. While the generated movements look impressive when compared to the untrained ones, they are still underperforming when confronted with the real world. First and foremost, the acceleration pattern of the players make no physical sense, i.e., the algorithm is not able to filter out local small noise, and the trajectories are not smooth enough. The evolution of the network during training is shown in Figure 9. In the end the GAN is not consistent enough when asked to generate large samples of data: too many trajectories do not look realistic.
4.2 Variational Autoencoder
In parallel, we implemented a Variational Autoencoder (VAE) as introduced by Kingma and Welling (2013). Like a GAN, a VAE is an unsupervised machinelearning algorithm that gives rise to a generative model.
We will apply the VAE algorithm on normalized trajectory data spanning 50 seconds. We call the set of all such trajectory data . As the trajectories are sampled at intervals of 0.1 seconds, this means that we can identify with .
A VAE consists of two neural networks, an encoder and a decoder. The encoder is a function (parametrized by a vector )
that maps from the product of the space of input data and a space of noise variables , to the socalled latent space . We identify the space with (). The decoder is a function (parametrized by a vector )
which maps from the latent space and a second space of noise variables back to the data space .
We choose the spaces of noise variables and to be Euclidean, with the same dimension as and respectively, and endow them with standard Gaussian measures.
The encoder and decoder have a special structure. We implemented (as neural networks) functions
and chose
Here, is a diagonal matrix with on the diagonal. Equivalently, is just the elementwise product of and .
Similarly, we implemented a function
and selected a constant and chose
The decoder provides us with a generative model for the data: to generate a data point we first sample and independently according to standard normal distributions, after which we apply the decoder to the pair . Alternatively, we can generate zeronoise samples by only sampling and computing .
The Variational Autoencoder is the composition of the encoder and decoder in the sense that
The parameters and of the VAE are optimized simultaneously, so that when we apply the VAE to a randomly selected triple of trajectory , noise variable and noise variable , the result is close to the original trajectory, at least on average.
To this end, we follow Kingma and Welling (2013) and minimize an average loss, for the loss function given by
(4.1) 
For a derivation of this loss function, we refer the reader to the Appendix.
We implemented the Autoencoder in the Keras library for Python (
Chollet et al., 2015). The library comes with an example VAE which we took as a starting point. We introduced a hidden layer in the encoder and in the decoder, which we both identified with , and implemented the functions and aswhere
is the composition of an affine map and ReLu activation functions, the functions
are linear and is the exponential function applied componentwise.Similarly,
where the function is again a composition of an affine map and ReLu activation functions and the function is a composition of an affine map and sigmoid activation functions.
We trained the model, i.e. we adjusted the parameters and
to minimize the average loss, using the ‘rmsprop’ optimizer in its default settings. Whether the model trained successfully or not did seem to depend crucially on the version of the libraries used. For the results presented below, we used Keras version 2.1.3 on top of Theano version 1.0.1. We first set
. After training for 1000 epochs, the average loss was slightly below .We used the VAE to approximate trajectories. We sampled at random trajectories from the data, and compared them to their approximations
The average absolute deviation per coordinate per timestep (expressed as a ratio with respect to the dimensions of the playing field) was approximately , the average squared error per coordinate per time step was approximately and the average maximum error per coordinate, taken over the whole trajectory, was less than .
In Figure 10 we show the result of sampling four random trajectories from the data, and comparing them to their approximation by the VAE. The approximating trajectories are much smoother than the original ones. Some qualitative features of the original paths, such as turns and loops, are also present in the approximating paths. Even though the average error in the distance per coordinate per time step is relatively small, visually there is still quite some deviation between the true and the approximating trajectories. We expect, however, that with a more extensive network, consisting of more convolutional layers, we can greatly improve the approximation.
Next, we use the decoder of the VAE as a generative model. In particular, we sample trajectories in at random by first sampling according to a standard normal distributions, and computing the trajectory . A collection of six trajectories generated in this way is shown in Figure 11. At first sight, the generated trajectories look like they could have been real trajectories of football players. However, they are in general smoother than the real trajectories. We could also have generated trajectories by sampling both and according to standard normal distributions and computing . However, those trajectories would have been much too noisy.
If we reduce the value of to approximately and retrain the model, the approximation of the trajectories becomes slightly better, and the final average loss reduces to 0.67 after training for 600 epochs. The corresponding plots look similar to Figure 10. However, if we now use the decoder to generate trajectories, most of the trajectories end up close to the boundary of the playing field: the dynamics of the generated trajectories is then clearly very different from the original dynamics.
In Appendix A, we explain this effect by investigating the different parts of the loss function given in (4.2). The upshot is that when is very small, the proportion of latent variables that are in the range of the encoder is very small (measured with the Gaussian measure on ). If one applies the decoder to a which is in the range of the encoder, one probably gets a realistic trajectory. But for latent variables not in the range of the encoder, there is no reason for the decoded trajectories to look realistic at all.
4.3 Discriminator
In the previous sections, we studied several methods to create generative models for the movement trajectories of football players, with the aim of capturing the underlying dynamics and statistics. In this section, we study to what extent movement trajectories of different soccer players can be distinguished. To this end, we test the Discriminator network of the GAN introduced in Section 4.1 on data of different soccer players. We train the Discriminator on the data of two soccer players, and then test if the Discriminator is able to distinguish their motion patterns. The success rate of the Discriminator to distinguish one player from the other then gives some insight in how different are the movement behaviors of two different players.
The loss function for the Discriminator is the same as in Section 4.1. The data we use as input for the Discriminator are coordinates of 10second player trajectories. We test the Discriminator on these unedited trajectories, and on centered trajectories, where the coordinates of each trajectory are centered such that the first coordinate always equals . Thus, by using the uncentered data, the Discriminator may distinguish two players by using their positions on the field, whereas the Discriminator can only use movement patterns of particular players when the centered data are used.
Figure 12 shows the Discriminator loss function for both players as a function of the number of training steps for two different sets of two players. We see that the loss function declines more for the uncentered data than for the centered data. Thus, the Discriminator distinguishes uncentered trajectories based on the location on the field where the movement pattern happens. The two different examples also show that it is easier to distinguish some players than others.
Table 1 shows the success rate of correctly identifying the player corresponding to a given trajectory after the training period for the two sets of players of Figure 12. The success rate of the Discriminator using the uncentered data is higher than for the centered data in both examples. Using the centered data, the Discriminator has difficulties distinguishing between players 1 and 2 in the first example. In the second example, the success rate is much higher. Thus, some players display more similarities in their movement patterns than other players.
Player 1  Player 2  Player 3  Player 4  

example 1  noncentered  0.74  0.9  
centered  0.2  0.96  
example 2  noncentered  0.98  0.82  
centered  0.54  0.95 
5 Conclusion and future work
We used several methods to learn the spatiotemporal structure of trajectories of football players. With the statespace modeling approach we extracted velocity information from the trajectory data, and learned basic statistics on the motion of individual players. With deep generative models, in particular Variational Autoencoders, we captured the approximate statistics of trajectories by encoding them into a lower dimensional latent space. Due to limitations on time and computational power, we did not manage to successfully train Generative Adversarial Nets on the data. Nonetheless, we were able to use the Discrimator network to distinguish between different football players based on their trajectory data. The algorithm was more successful if we used noncentered rather than centered data, and was better at distinguishing between some players than others.
It is very likely that with deeper convolutional neural networks, we can train VAEs that approximate the statistics of the player trajectories even better. Besides, the approach can easily be extended to approximate trajectories of multiple players and the ball, although we may need more data to get an accurate model.
A big challenge is to interpret the latent space of the VAE. Ideally, one would be able to recognize qualities of the players as variables in the latent space. Although this is a difficult task in general, we expect that by adding additional structure in the architecture of the VAE, we can at least extract some relevant performance variables per player and recognize differences between players. Moreover, we could unify statespace models with VAEs to increase the interpretability of the latent variables.
By continuing this line of work, we could conceivably find an appropriate state space such that the football game can be fitted into a Reinforcement Learning framework. This framework may then be used to find optimal strategies, and to extract individual qualities of football players.
Appendix A Derivation of loss function of VAE
In this appendix we will derive the loss function for the Variational Autoencoder. The loss function is the same as the one used by Kingma and Welling (2013), and more generally corresponds to the usual loss function in variational inference, but our presentation here is slightly nonstandard and is based on general measure theoretic probability.
Before we can discuss the loss function and its meanings, we need to introduce notations for the various measures encountered in the problem. Both the encoder and the decoder of the VAE will induce measures on the product space , and the optimization procedure will aim to bring these measures as close as possible to each other. We will first describe the encoder and the decoder measures.
Encoder measure
Recall from Section 4.2 that we can identify and with . In addition, we let and be subsets of and we set and in our own implementation. Let us start by assuming that trajectories are obtained by sampling independently according to a distribution , which we assume to be absolutely continuous with respect to the fold product of Lebesgue measures on with density . We denote the standard Gaussian measure on by . The encoder induces a measure on the space by
where is the identity map, and is the pushforward measure of induced by measurable function such that for any measurable set . Equivalently, for every bounded and continuous function it holds that
We observe that and are indeed the marginals of the measure , and similarly we will denote by the marginal of etc.. We will occasionally refer to as the encoder measure or the recognition model.
Finally, we denote the conditional distribution on induced from the encoder given by
We assume that its density with respect to , the standard Gaussian measure on , exists and we denote it by . The measure is then absolutely continuous with respect to with density
Decoder measure
Analogously, we denote by and the standard Gaussian measures on and respectively. The decoder induces a measure on the space , given by
Again, we observe that and are the marginals of and we denote by
the marginal probability distribution on
. We refer to as the decoder measure or the generative model. We will assume that is absolutely continuous with respect to the product measure , and that its density is strictly positive. Since is the marginal of it follows that the marginal density is defined byand for every . We also define the conditional density
We denote the corresponding conditional probability distribution on
by and note that it coincides with the law of the decoder conditioned on ,(A.1) 
In the particular context of the Variational Autoencoder explained in Section 4.2, we find that
Similarly, we define the marginal density by
Note that for all . We denote the associated probability distribution on by . We set
and denote by the associated conditional probability distribution that has density with respect to .
Note that by definition, the following version of Bayes’ Theorem holds
(A.2) 
Derivation of loss function
The loss function of the Variational Autoencoder is built around the relative entropy, or more commonly known as the KullbackLeibler (KL) divergence. If and are probability measures on a measure space
, the KullbackLeibler divergence
is defined to be if is not absolutely continuous with respect to , and otherwisewhere is the RadonNikodym derivative which we can take as the density of with respect to .
We aim to minimize over all and an approximation of
This has the interpretation that we search for and so that it is hard to distinguish the encoder distribution from the decoder distribution .
In view of Bayes’ Theorem given by (A.2), we can write this KL divergence in different ways as follows
(A.3)  
The last of these expressions yields that
The first term in this expression is small when the true distribution is hard to distinguish from the distribution of generated by the decoder . The second term is small when, on average, the conditional distribution of the encoder on given is hard to distinguish from the conditional distribution of the decoder on given .
As usual in variational inference (cf. scibib:blei), we subtract and minimize instead
(A.4)  
This expression can be recognized as being at the start of the derivation for the loss function used in Kingma and Welling (2013). (We assume and in particular that is absolutely continuous with respect to the Lebesgue measure .)
However, the marginal density is often inaccessible, i.e. it is often impossible to compute and hard to approximate. Therefore, one rewrites the functional in a different way. By the representation given in (A.3) we find
Our choice of loss function is therefore
(A.5)  
(A.6)  
which up to scaling and a constant agrees with the loss function used in (4.2).
This derivation allows us to interpret the effects of the different terms and constants in this loss function. The first term in (A.5) can be interpreted as a (negative) loglikelihood, the probability of observing conditioned on the property that . This term is written in detail on the line (A.6), where the Gaussian structure of translates into a squared distance weighted by the factor .
The second term in (A.5) measures the divergence between the conditional distribution and the standard Gaussian.
For very small values of , the first term in (A.5) dominates the second. In practice, this means that for the parameters and found by the optimization procedure, there is no guarantee that the distribution is close to the standard Gaussian measure
; in general it will be far away. Heuristically, the effective range of the encoder will have small
measure.For values of that are in the effective range of the encoder, the decoder will produce realistic trajectories. However, for the values of that are not in the range, there is no reason for the decoder to produce realistic trajectories. In particular, the generative model that first independently samples and according to and respectively and then computes , will have very different statistics from the model that samples from if is very small.
References
 Bruner and Deshpande (2017) J. Bruner and A. Deshpande. Generative adversarial networks for beginners. Retrieved from https://www.oreilly.com/learning/generativeadversarialnetworksforbeginners, 2017.
 Chollet et al. (2015) F. Chollet et al. Keras. https://keras.io, 2015.
 Goff and Carré (2009) J. E. Goff and M. J. Carré. Trajectory analysis of a soccer ball. American Journal of Physics, 77(11):1020–1027, 2009.
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680. 2014.
 Helske (2017) J. Helske. KFAS: Exponential family state space models in R. Journal of Statistical Software, 78(10):1–39, 2017.
 Karazeev (2017) A. Karazeev. Generative adversarial networks (GANs): Engine and applications. Retrieved from https://blog.statsbot.co/generativeadversarialnetworksgansengineandapplicationsf96291965b47, 2017.
 Kingma and Welling (2013) D. P. Kingma and M. Welling. AutoEncoding Variational Bayes. arXiv:1312.6114, 2013.
 Koopman and Durbin (2000) S. J. Koopman and J. Durbin. Fast filtering and smoothing for multivariate state space models. Journal of Time Series Analysis, 21(3):281–296, 2000.
 Koopman and Durbin (2003) S. J. Koopman and J. Durbin. Filtering and smoothing of state vector for diffuse statespace models. Journal of Time Series Analysis, 24(1):85–98, 2003.
 Ren et al. (2008) J. Ren, J. Orwell, G. A. Jones, and M. Xu. Realtime modeling of 3d soccer ball trajectories from multiple fixed cameras. IEEE Transactions on Circuits and Systems for Video Technology, 18(3):350–362, 2008.
 Seita (2017) D. Seita. Understanding generative adversarial networks. Retrieved from https://danieltakeshi.github.io/2017/03/05/understandinggenerativeadversarialnetworks/, 2017.
 Yu et al. (2003a) X. Yu, Q. Tian, and K. W. Wan. A novel ball detection framework for real soccer video. In International Conference on Multimedia and Expo, 2003. ICME ’03. Proceedings., pages II–265–8 vol.2, 2003a.
 Yu et al. (2003b) X. Yu, C. Xu, Q. Tian, and H. W. Leong. A ball tracking framework for broadcast soccer video. In 2003 International Conference on Multimedia and Expo. ICME 03. Proceedings (Cat. No.03TH8698). IEEE, 2003b. doi: 10.1109/icme.2003.1221606.
Comments
There are no comments yet.