1.1 The model
Most physical phenomena in our visual environments are spatial-temporal processes. In this paper, we study a generative model for spatial-temporal processes such as dynamic textures and action sequences in video data. The model is a non-linear generalization of the linear state space model proposed by [Doretto et al.2003] for dynamic textures. The model of [Doretto et al.2003]
is a hidden Markov model, which consists of a transition model that governs the transition probability distribution in the state space, and an emission model that generates the observed signal by a mapping from the state space to the signal space. In the model of[Doretto et al.2003], the transition model is an auto-regressive model in the -dimensional state space, and the emission model is a linear mapping from the -dimensional state vector to the -dimensional image. In [Doretto et al.2003]
, the emission model is learned by treating all the frames of the input video sequence as independent observations, and the linear mapping is learned by principal component analysis via singular value decomposition. This reduces the-dimensional image to a -dimensional state vector. The transition model is then learned on the sequence of -dimensional state vectors by a first order linear auto-regressive model.
Given the high approximation capacity of the modern deep neural networks, it is natural to replace the linear structures in the transition and emission models of [Doretto et al.2003] by the neural networks. This leads to the following dynamic generator model that has the following two components. (1) The emission model, which is a generator network that maps the -dimensional state vector to the
-dimensional image via a top-down deconvolution network. (2) The transition model, where the state vector of the next frame is obtained by a non-linear transformation of the state vector of the current frame as well as an independent Gaussian white noise vector that provides randomness in the transition. The non-linear transformation can be parametrized by a feedforward neural network or multi-layer perceptron. In this model, the latent random vectors that generate the observed data are the independent Gaussian noise vectors, also called innovation vectors in[Doretto et al.2003]. The state vectors and the images can be deterministically computed from these noise vectors.
1.2 The learning algorithm
Such dynamic models have been studied in the computer vision literature recently, notably[Tulyakov et al.2017]. However, the models are usually trained by the generative adversarial networks (GAN) [Goodfellow et al.2014] with an extra discriminator network that seeks to distinguish between the observed data and the synthesized data generated by the dynamic model. Such a model may also be learned by variational auto-encoder (VAE) [Kingma and Welling2014] together with an inference model that infers the sequence of noise vectors from the sequence of observed frames. Such an inference model may require a sophisticated design.
In this paper, we show that it is possible to learn the model on its own using an alternating back-propagation through time (ABPTT) algorithm, without recruiting a separate discriminator model or an inference model. The ABPTT algorithm iterates the following two steps. (1) Inferential back-propagation through time, which samples the sequence of noise vectors given the observed video sequence using the Langevin dynamics, where the gradient of the log posterior distribution of the noise vectors can be calculated by back-propagation through time. (2) Learning back-propagation through time, which updates the parameters of the transition model and the emission model by gradient ascent, where the gradient of the log-likelihood with respect to the model parameters can again be calculated by back-propagation through time.
The alternating back-propagation (ABP) algorithm was originally proposed for the static generator network [Han et al.2017]. In this paper, we show that it can be generalized to the dynamic generator model. In our experiments, we show that we can learn the dynamic generator models using the ABPTT algorithm for dynamic textures and action sequences.
Two advantages of the ABPTT algorithm for the dynamic generator models are convenience and efficiency. The algorithm can be easily implemented without designing an extra network. Because it only involves back-propagations through time with respect to a single model, the computation is very efficient.
1.3 Related work
The proposed learning method is related to the following themes of research.
Dynamic textures. The original dynamic texture model [Doretto et al.2003] is linear in both the transition model and the emission model. Our work is concerned with a dynamic model with non-linear transition and emission models. See also [Tesfaldet, Brubaker, and Derpanis2018] and references therein for some recent work on dynamic textures.
Chaos modeling. The non-linear dynamic generator model has been used to approximate chaos in a recent paper [Pathak et al.2017]. In the chaos model, the innovation vectors are given as inputs, and the model is deterministic. In contrast, in the model studied in this paper, the innovation vectors are independent Gaussian noise vectors, and the model is stochastic.
GAN and VAE. The dynamic generator model can also be learned by GAN or VAE. See [Tulyakov et al.2017] [Saito, Matsumoto, and Saito2017] and [Vondrick, Pirsiavash, and Torralba2016] for recent video generative models based on GAN. However, GAN does not infer the latent noise vectors. In VAE [Kingma and Welling2014], one needs to design an inference model for the sequence of noise vectors, which is a non-trivial task due to the complex dependency structure. Our method does not require an extra model such as a discriminator in GAN or an inference model in VAE.
Models based on spatial-temporal filters or kernels.
The patterns in the video data can also be modeled by spatial-temporal filters by treating the data as 3D (2 spatial dimensions and 1 temporal dimension), such as a 3D energy-based model[Xie, Zhu, and Wu2017] where the energy function is parametrized by a 3D bottom-up ConvNet, or a 3D generator model [Han et al.2019] where a top-down 3D ConvNet maps a latent random vector to the observed video data. Such models do not have a dynamic structure defined by a transition model, and they are not convenient for predicting future frames.
The main contribution of this paper lies in the combination of the dynamic generator model and the alternating back-propagation through time algorithm. Both the model and algorithm are simple and natural, and their combination can be very useful for modeling and analyzing spatial-temporal processes. The model is one-piece in the sense that (1) the transition model and emission model are integrated into a single latent variable model. (2) The learning of the dynamic model is end-to-end, which is different from [Han et al.2017]’s treatment. (3) The learning of our model does not need to recruit a discriminative network (like GAN) or an inference network (like VAE), which makes our method simple and efficient in terms of computational cost and model parameter size.
2 Model and learning algorithm
2.1 Dynamic generator model
Let be the observed video sequence, where is a frame at time . The dynamic generator model consists of the following two components:
where . (1) is the transition model, and (2) is the emission model. is the -dimensional hidden state vector. is the noise vector of a certain dimensionality. The Gaussian noise vectors are independent of each other. The sequence of follows a non-linear auto-regressive model, where the noise vector encodes the randomness in the transition from to in the -dimensional state space. is a feedforward neural network or multi-layer perceptron, where denotes the weight and bias parameters of the network. We can adopt a residual form [He et al.2016] for to model the change of the state vector. is the -dimensional image, which is generated by the -dimensional hidden state vector . is a top-down convolutional network (sometimes also called deconvolution network), where denotes the weight and bias parameters of this top-down network. is the residual error. We let denote all the model parameters.
Let . consists of the latent random vectors that need to be inferred from . Although is generated by the state vector , are generated by . In fact, we can write , where composes and over time, and denotes the observation errors.
2.2 Learning and inference algorithm
Let be the prior distribution of . Let be the conditional distribution of given , where
is the identity matrix whose dimension matches that of. The marginal distribution is with the latent variable
integrated out. We estimate the model parameterby the maximum likelihood method that maximizes the observed-data log-likelihood , which is analytically intractable. In contrast, the complete-data log-likelihood , where , is analytically tractable. The following identity links the gradient of the observed-data log-likelihood to the gradient of the complete-data log-likelihood :
where is the posterior distribution of the latent given the observed . The above expectation can be approximated by Monte Carlo average. Specifically, we sample from the posterior distribution using the Langevin dynamics:
where indexes the time step of the Langevin dynamics (not to be confused with the time step of the dynamics model, ), where is the identity matrix whose dimension matches that of , and denotes all the sampled latent noise vectors at time step . is the step size of the Langevin dynamics. We can correct for the finite step size by adding a Metropolis-Hastings acceptance-rejection step. After sampling using the Langevin dynamics, we can update by stochastic gradient ascent
where the stochasticity of the gradient ascent comes from the fact that we use Monte Carlo to approximate the expectation in (3). The learning algorithm iterates the following two steps. (1) Inference step: Given the current , sample from according to (4). (2) Learning step: Given , update according to (5). We can use a warm start scheme for sampling in step (1). Specifically, when running the Langevin dynamics, we start from the current , and run a finite number of steps. Then we update in step (2) using the sampled . Such a stochastic gradient ascent algorithm has been analyzed by [Younes1999].
Since , both steps (1) and (2) involve derivatives of
where the constant term does not depend on or . Step (1) needs to compute the derivative of with respect to . Step (2) needs to compute the derivative of with respect to . Both can be computed by back-propagation through time. Therefore the algorithm is an alternating back-propagation through time algorithm. Step (1) can be called inferential back-propagation through time. Step (2) can be called learning back-propagation through time.
To be more specific, the complete-data log-likelihood can be written as (up to an additive constant, assuming )
The derivative with respect to is
The derivative with respect to is