1 Introduction
While classification of image and video with Convolutional Neural Networks (CNN) is becoming an established practice, unsupervised learning and generative modeling remain to be challenging problems in deep learning. A successful construction of generative models of a visual process enables the possibility of generating sequences of video frames such that the appearance as well as the dynamics approximately resemble the original training process without copying it. This procedure is typically referred to as video generation
[1, 2] or video synthesis [3]. More technically, this means, that in addition to a suitable probability model for the individual frames, a probabilistic description for the frametoframe transition is also necessary. Analysis and reproduction of visual processes simplifies considerably, if this transition can be assumed to be a linear function. For instance, linear transformations are easily invertible and by means of spectral analysis, it can be studied how successive applications of the same transformation behave in the long term.
Unfortunately, most frame transitions in realworld visual processes unlikely are linear functions. Nevertheless, unsupervised learning has come up with many approaches to fit linear transition models to realworld processes, for instance by using linear lowrank [4], or sparse approximations of the frames [5], or applying the kernel trick to them [6].
The success of Generative Adversarial Networks (GAN) [7] and Variational Autoencoders (VAE) [8] have lead to an increased interest in deep generative learning and it seems natural to apply such techniques to sequential processes. We approach this idea from the perspective of linearization in order to keep the model as simple as possible. In an analogous way as physicists transforming nonlinear differential equations into linear ones by means of an appropriate change of variables, our approach is to learn a latent representations of visual processes, such that the latent statetostate transition can be described by a linear model. To this end, we jointly learn a nonlinear observation and a linear state transition function by means of a modified VAE.
2 Related Work
Similarly to our work, the authors of [9] combine Linear Dynamic Systems (LDSs) with VAEs. However, the focus of their work is on control rather than on synthesis. Furthermore, their model is locally linear and the transition distribution is modeled in the variational bound, whereas we model it as a separate layer. This also is the main difference to the work in [10], where VAEs are combined with linear dynamic models for forecasting images in video sequences, and to [11]
, in which VAEs are used as Kalman Filters.
The work [12] deals with linearizing transformations under uncertainty via neural networks. It resembles this work in that we also focus on representation learning rather than on a particular application. However, unlike our work, it does not employ VAEs. Theoretical groundwork regarding learned visual transformations has been done in [13, 14, 15] and [16]. More generally, the synthesis of video dynamics by means of neural networks has been discussed, among others as in [17] and [18].
Finally, the core contribution of this work is a combination of neural networks with Markov processes. This has been the subject of many works in the recent past. For a broad overview of results in this field, the reader is referred to Chapter 20 of [19].
3 Visual Processes and Linearization
3.1 Dynamic Systems
Dynamic textures [4] have popularized LDSs in the modeling of visual processes. Typically, an LDS is of the following form
(1) 
where is the low dimensional state space variable at time , the state transition matrix, the observation at time and the observation matrix
. The vector
represents a constant offset in the observation space. The input terms and are modeled as zeromean i.i.d. Gaussian noise, and are independent of .The simplicity of state transition in the model (1) enables straightforward prediction, generation, and analysis of observations or synthesis. For realworld visual processes that are often highly nonlinear, it is therefore of great interest to find a model that linearizes the underlying process, so that in some latent state space representation, the state transition admits both the linearity and Gaussianity as depicted in Eq. (1). Specifically, this work focuses on the following nonlinear dynamic system model, i.e., a linear state transition and a nonlinear observation mapping
(2) 
where is assumed to be nonlinear in the rest of the paper. For algorithmic reasons, we assume that
is drawn from an isotropic Gaussian distribution, i.e.,
(3) 
Note, that the model in Eq. (2) is not unique with respect to changes of basis in the state space [20]. Let be a full rank matrix, we define the following substitution
(4) 
Then the following system is equivalent to (2)
(5) 
Specifically, given one visual process described by (2) with , one can define an equivalent system via the transformations
(6) 
If is implemented via a neural network, we can ensure that it accounts for a possible change of basis. Therefore, without loss of generality, we propose the following assumption on the latent samples .
Assumption 1.
Remark 1.
If the state transition matrix is given, and the latent samples are assumed to be stationary, then Assumption 1 essentially identifies the process noise model. Namely, we have
(8) 
and in order to make sure that the latent states remain Gaussian in sequential synthesis scenarios, i.e., , we just need to ensure that the process noise is zeromean and has the covariance matrix .
3.2 Linearizability of Nonlinear Visual Transformations
The purpose of this subsection is to justify our aim to learn a linear statetransition model (2) from a conceptual point of view and provide cues on how to chose a neural network architecture for linearizing visual transformations.
In the model description (2), the observation mapping is modeled by a nonlinear function . In what follows, we aim to show the feasibility of the linear model on the state transitions. Let us consider a visual transformation of observations in by . Such transformations are in general very difficult to model, and hence it is very unlikely to find a global observation mapping such that the respective transformation in the latent space can be exactly modeled by a linear transition. Here, we firstly form the notion of local linearization of a nonlinear self map.
Definition 1.
Let , be a continuous self map, and be a local diffeomorphism at all . The map is said to be a local linearizer of at , if there exists a matrix such that the following equality holds true with
(9) 
Here, the map behaves as a chart of the data manifold . If and is a fixed point of , then it is obvious that the map is linearized by at . In general, the map cannot be guaranteed to have a fixed point. Nevertheless, by the Brouwer’s fixedpoint theorem, we propose the following assumption to ensure existence of local linearizability of .
Assumption 2.
The set is compact and convex, and is a continuous self map, i.e., the map has at least a fixed point .
This assumption can be easily justified by applications in image/video processing, where images lie in some hypercube, e.g. . It has some resemblance to control theory, where linearization of nonlinear dynamical systems can be carried out around equilibrium points [21]. The following proposition thus makes the assumption of existence of to characterize neural networks that locally linearize transformations.
Proposition 1.
Let , be a continuous self map, and be a local diffeomorphism at all . If is a fixed point of , then the following map
(10) 
locally linearizes .
Proof.
Since is a diffeomorphism, is differentiable. We denote by the Jacobian matrix of at and Taylor’s theorem yields
(11) 
Knowing that is a fixed point of , we can rewrite the expression by substituting as
(12) 
We define and substitute
(13) 
into (12). This yields
(14) 
which finalizes the proof. ∎
The error term in (11) is driven by the curvature of around . Incidentally, the authors of [12] also achieve linearization by penalizing curvature.
Essentially, Proposition 1 suggests to include a bias for the first layer of a linearizing neural network that tries to implement . However, in general this is not enough to achieve low linearization error globally
. In fact, a neural network consisting of one single affine layer suffices to locally linearize an appropriate transformation
.3.3 Linearization via CNNs
In this subsection, we discuss several additional heuristics for linearization, and argue that employing convolutional layers are a suitable choice.
We start by observing that CNNs are capable of representing data in a way such that it is almost invariant to certain classes of transformations in the data [22, 23]. In other words, a transformation applied to a data sample does not greatly displace its representation. For illustration purposes, let denote an image depicting an object, and
an image depicting the same object, deformed by applying certain forces to it. Due to the curse of dimensionality, the application of
can lead to a significant displacement of the pixel representation in the Euclidean space, but analysis of simplified CNNs with fixed filter weights and absolute value or ReLU activation functions, socalled
Scattering transforms, has shown that it is possible to find a representation that is contracting with respect to spatial deformations of images . More specifically, in [22] a deformation is described as a warping of spatial coordinates. For such deformations, a bound was derived such that(15) 
holds, if is implemented by a Scattering transform. Even though the discussion in [22] is limited to deformations, it is generally assumed in [24] that approximately invariant representations with respect to much broader classes of transformations can be learned by CNNs. The smaller the contraction constant , the more regularity is introduced to the data, with respect to the linearizability of . To see this, we introduce the minimal expected linearization error , which measures how well a transformation can be modeled by a multiplication with a matrix as follows
(16) 
The following inequality easily follows
(17) 
Note however, that the measure does not account for how much expands or shrinks its input. The contraction constant , for instance, was derived for approximately normpreserving functions [22].
To summarize, we can hope to linearize broad classes of transformations, given the right neural network architecture. In particular, the preceding discussion suggests autoencoders with (almost everywhere) differentiable activation functions to account for the diffeomorphism property in Proposition 1, and input layers with bias to account for (10). Due to the contraction properties suspected from CNNs, it seems natural to employ convolutional layers and ReLU activations for both the encoder and the decoder.
Until now, we discussed heuristics for the choice of architecture, but the advantage of neural networks is that we can make design goals like linearizability explicit by formulating an appropriate loss function. We tackle this problem from a stochastic perspective by constraining the joint probability distribution of succeeding samples
in the latent space.4 Variational Autoencoders for Sequences
4.1 Variational Autoencoders: Review
According to Assumption 1, the observation mapping transforms a standard normal distribution to the observation distribution , where for a latent sample , the expected observation is
and the according conditional probability distribution
is given by the noise model (3).Conveniently, VAEs provide a framework to do just that. Let
be a standard normal distributed random variable. Given a set
of realizations of a random variable with the distribution , the objective of the VAE is to maximize the loglikelihood function(18) 
by learning a parametrized function that approximately transforms to . Accordingly to (3), we fix the following assumption,
(19) 
Then, applying the expectation yields
(20) 
The parameter should thus maximize the term .
However, directly maximizing the expected value of (20) by standard Monte Carlo methods is infeasible for computational reasons [25]
. Luckily, variational inference provides a lower bound for the likelihood function that can be optimized by stochastic gradient descent. Let
be a parametrized, measurable function which maps from and to the codomain of . Let the random variable(21) 
have the probability density function
. Let us consider the expression(22) 
with
denoting the KullbackLeibler Divergence (KLD). Since the KLD is always nonnegative, then the following inequality holds true
(23) 
We can rewrite the KLD as an expected value. Since is not a random variable, it is not affected by the expected value, and we reformulate (22) as
(24) 
As a consequence, the lower bound of can be maximized by minimizing
(25) 
It is then straightforward to Compute the gradient of the squared norm in
. For an estimation of the expected value, we draw one sample
from and several samples from by applying on samples of standard normal noise. This is known as the reparametrization trick. Slightly more elaborate is the KLD term in . Since the distribution depends on , in order to make the task computationally tractable, is modeled as an affine function of the form(26) 
In technical terms, this means the encoder part of Fig. 1 is a subnetwork that maps from a training sample to the two vectors and . The random variable is thus described by the distribution
(27) 
and the KLD in (25) can be written as
(28) 
In such a way, stochastic gradient descent of (25) can be therefore applied via backpropagation.
4.2 Markov Assumptions
We want to model a sequential, stochastic visual process (2) such that is performed by the decoder part of a VAE. Let us assume that we are given a sequence of vectorized video frames. Hereby, is carried out by a neural network described by the trainable parameter tuple . If we neglect the temporal order of the frames, we can theoretically train a VAE to generate frames similar to , because the latent variables are from the standard normal distribution. However, crucial to synthesizing a visual process is not only the capability to create stillimage frames, but also to create them according to a temporal model. First and foremost, this implies a possibility to infer in addition to . The easiest way to approach this is by first learning by training the VAE and then inferring via squared error minimization as
(29) 
Such an approach clearly has its advantages in terms of simplicity, but given the high capacity of trainable neural networks, it is more elegant to learn and simultaneously. By doing so, we force the latent variables already during the training process to fit a linear transition model instead of fitting a linear state transition model to a sequence of already learned latent variables.
The temporal model at hand is a first order Markov process. Initially, the data needs to be adapted to the problem. We formulate our problem setting thus as a generative model for , where each observation
(30) 
contains two succeeding frames.
A sample is a realization of the random variable and is composed of two subvectors with the same statistical properties. This means, the distributions of the upper and the lower half subvector of , i.e., of the current and the predicted frame, must be identical. Specifically, following the discussion of Section 4.1, we assume that is driven by a latent variable and the conditional distribution has the form
(31) 
where
denotes the variance of the observation noise
in (2). The subvectors stand for the latent variables, i.e., the state space vectors, belonging to the upper and lower half of. As agreed on before, their marginal distribution is standard normal. However, their joint distribution is not, since the choice of
depends on . In fact, from the previous section, we can deduce the joint probability distribution as(32) 
This contradicts the premise of the VAE which models latent variables by standard normal distributions. However, if we assume that it is possible to adapt the model of the classical VAE, such that the decoder part in Fig. 1 can be fed with samples drawn from distribution (32) and make the parameter trainable, we can simultaneously learn the observation and dynamic state transition of a visual process.
4.3 A Dynamic VAE
In this section, we propose a neural network architecture that produces samples similar to from realizations of the distribution (32). We achieve this by modeling the linear dynamics with an additional layer between the latent space layer and the decoder. Let us refer to such a layer as the dynamic layer and to the architecture in its entirety as a Dynamic VAE. The purpose of the dynamic layer is to map the random variable which has standard normal distribution, to a random variable which has the distribution indicated in (32). Let us denote by the upper and lower half of . Then such a mapping can be achieved by a function of the form
(33) 
where is a matrix such that . Fig. 2 depicts the resulting architecture.
In order to guarantee stationarity, we need to ensure that condition (8) is satisfied. This can be done by including a regularizer. The loss function of the Dynamic VAE parameters and is thus defined as
(34) 
where should be chosen high enough to keep the regularizer close to . The KLD term depends on via (28). Note however, that is to be replaced by in this context.
5 Experiments
5.1 Overview
The experiments treated three different kinds of visual processes. In each experiment, the Dynamic VAE was trained with a sequence of frames. Afterwards, each sequence was generated from the trained model. Latent states of dimension were synthesized according to the rule
(35) 
where were drawn from a standard normal distribution and the initial state was inferred from the expected value of the conditional latent distribution of a test frame pair . The frame pair was excluded from the training set. This was done in order to improve the significance of the experimental outcome with respect to how well the model generalizes.
The observer neural network
was implemented via a fullyconnected layer followed by three convolutional layers with ReLU activations and nearest neighbor upsampling. The number of channels was decreased with each layer by the factor four, such that the number of pixels in each hidden layer remains roughly unchanged. The encoder mirrored the structure with the same number of convolutional layers, ReLU activations, increasing number of channels and MaxPooling layers. For each convolutional layer, the same filter size was used. All experiments were implemented in Python 3.6 with PyTorch 0.1.12 on CUDA 8.0. The choice of parameters for each experiment is described in Table
1. The code is publicly available [26].Experiment  Filter size  

MNIST  1.5  100.0  
UCLA50  0.31  100.0  
NORB  5.0  100.0 
Evaluating generative models is particularly challenging. This is due to the very nature of the problem that demands measuring the similarity of the probability distribution underlying the training data to the probability distribution that generated the test data. Neither of the two is available in closed form but can be only estimated from a limited number of samples in a very highdimensional space. It is thus an established practice to evaluate generative models by visual inspection of the generated samples [19]
. However, it is important to consider overfitting that can lead to supposedly very realistic samples. The following experiments have the purpose of demonstrating the principal capability of the proposed methods to infer a linear model from a highly nonlinear process by generating sequences from Gaussian noise. Therefore, we acknowledge the fact that our choice of hyperparameters is possibly suboptimal and architectures that are optimized for a specific task could lead to visually more appealing results. Due to space constraints, only a few experimental outcomes are shown in each subsection. The supplementary material to this paper contains synthesis results for each performed experiment.
5.2 Learning to Count
In the first series of experiments, we trained our architecture with sequences of images from the MNIST data set. One sequence was used for each experiment. The aim was to learn a generative, sequential model that can produce repeating sequences of numbers. For instance, in the first experiment, the frame transition to be linearized was a mapping of a 1 to a 2, a 2 to a 3 and a 3 to a 4 and a 4 to a 1. Each training sequence contained 7999 MNIST image pairs.
Fig. 3 visualizes the synthesis of the sequences 12341234… and 67896789… in comparison to the result of a purely linear model as described in [4]. The dynamic VAE did well in synthesizing number sequences of length 4 or smaller. More challenging were longer sequences as Fig. 4 shows. While some sequences, like and , could be sufficiently well trained, other sequences, like appeared to yield nonstationary systems, or like were to unpredictable for the Dynamic VAE.
5.3 Dynamic Textures
The second series of experiments focused on the synthesis of dynamic textures. In each experiment, the Dynamic VAE was trained with one class of dynamic texture from the cropped UCLA50 database [27, 28].
Fig. 5 depicts the synthesis results for the dynamic texture wfallsc. In general, we observed that the synthesis of predictable sequences, e.g. oscillations or cyclic phenomena produces realistic results. Chaotic textures yielded some frames that looked artificial. This could be observed, for instance, in the synthesis of the candle dynamic texture.
5.4 Rotating Objects
The Small NORB [29] dataset consists of pictures taken of different miniature objects under varying lighting conditions, elevation and azimuthal angles. One object at once was used for training. We trained our model to linearize a counterclockwise azimuthal rotation by . Since the Small NORB dataset contains little variability apart for the intentional one, we decided to exclude one configuration of lighting conditions and elevation angle form the training data and use the contained sequence of azimuthal positions as ground truth for our experiment. Generally, the rotation could be well reproduced by the linear state transition model, except for the category 1 which contains human figures. Fig. 6 depicts the Dynamic VAE synthesis of a rotating horse compared to a linear synthesis [4].
The angle is slightly higher than , since the columns are not aligned. The model seems to be confused by diametrical angles. For instance, at certain positions of Fig. 6, it becomes indeterminable whether the horse faces towards or away from the observer, leading to the skipping of the following .
6 Conclusion
This work presented an approach to infer linear models of visual processes by means of Variational Autoencoders. To this end, the classical VAE model was modified to include an additional layer that models the latent dynamics of the visual process. The capability of the proposed model was demonstrated in three series of synthesis experiments. Additionally, the aim of this work was to develop a notion of linearizability and what implications it has on the choice of neural network architectures. While yielding first conceptual results, we understand that the theoretical analysis on this matter has room for improvement. Therefore, in future work, we plan to gain further insights in the theoretical concept of linearizability but also improve the architecture to handle more complex data.
References

[1]
Vondrick, C., Torralba, A.:
Generating the future with adversarial transformers.
In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
 [2] De Souza, C., Gaidon, A., Cabon, Y., Lopez Pena, A.: Procedural generation of videos to train deep action recognition networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
 [3] Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: International Conference on Computer Vision (ICCV). Volume 2. (2017)
 [4] Doretto, G., Chiuso, A., Wu, Y.N., Soatto, S.: Dynamic textures. International Journal of Computer Vision 51(2) (2003) 91–109
 [5] Wei, X., Li, Y., Shen, H., Chen, F., Kleinsteuber, M., Wang, Z.: Dynamical textures modeling via joint video dictionary learning. IEEE Transactions on Image Processing 26(6) (2017) 2929–2943
 [6] Chan, A.B., Vasconcelos, N.: Classifying video with kernel dynamic textures. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2007) 1–6
 [7] Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. (2014) 2672–2680
 [8] Kingma, D.P., Welling, M.: Autoencoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
 [9] Watter, M., Springenberg, J., Boedecker, J., Riedmiller, M.: Embed to control: A locally linear latent dynamics model for control from raw images. In: Advances in neural information processing systems. (2015) 2746–2754
 [10] Johnson, M., Duvenaud, D.K., Wiltschko, A., Adams, R.P., Datta, S.R.: Composing graphical models with neural networks for structured representations and fast inference. In: Advances in neural information processing systems. (2016) 2946–2954
 [11] Krishnan, R.G., Shalit, U., Sontag, D.: Deep kalman filters. arXiv preprint arXiv:1511.05121 (2015)
 [12] Goroshin, R., Mathieu, M.F., LeCun, Y.: Learning to linearize under uncertainty. In: Advances in Neural Information Processing Systems. (2015) 1234–1242
 [13] Cohen, T.S., Welling, M.: Transformation properties of learned visual representations. arXiv preprint arXiv:1412.7659 (2014)

[14]
Cohen, T., Welling, M.:
Group equivariant convolutional networks.
In: International Conference on Machine Learning. (2016) 2990–2999
 [15] Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming autoencoders. In: International Conference on Artificial Neural Networks, Springer (2011) 44–51
 [16] Memisevic, R.: Learning to relate images. IEEE transactions on pattern analysis and machine intelligence 35(8) (2013) 1829–1846
 [17] Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances In Neural Information Processing Systems. (2016) 613–621
 [18] Xue, T., Wu, J., Bouman, K., Freeman, B.: Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In: Advances in Neural Information Processing Systems. (2016) 91–99
 [19] Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep learning. Volume 1. MIT press Cambridge (2016)
 [20] Afsari, B., Vidal, R.: The alignment distance on spaces of linear dynamical systems. In: Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on, IEEE (2013) 1162–1167
 [21] Perko, L.: Differential equations and dynamical systems. Volume 7. Springer Science & Business Media (2013)
 [22] Mallat, S.: Group invariant scattering. Communications on Pure and Applied Mathematics 65(10) (2012) 1331–1398

[23]
Wiatowski, T., Bölcskei, H.:
A mathematical theory of deep convolutional neural networks for feature extraction.
IEEE Transactions on Information Theory (2017)  [24] Mallat, S.: Understanding deep convolutional networks. Phil. Trans. R. Soc. A 374(2065) (2016) 20150203
 [25] Doersch, C.: Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 (2016)
 [26] Sagel, A.: Gitlab repository: https://gitlab.lrz.de/ga68biq/dynamicvae
 [27] Saisan, P., Doretto, G., Wu, Y.N., Soatto, S.: Dynamic texture recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Volume 2., IEEE (2001) II–II
 [28] Chan, A.B., Vasconcelos, N.: Probabilistic kernels for the classification of autoregressive visual processes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Volume 1. (June 2005) 846–851 vol. 1
 [29] LeCun, Y., Huang, F.J., Bottou, L.: Learning methods for generic object recognition with invariance to pose and lighting. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Volume 2., IEEE (2004) II–104