An Adaptive Dictionary Learning Approach for Modeling Dynamical Textures

by   Xian Wei, et al.
Technische Universität München

Video representation is an important and challenging task in the computer vision community. In this paper, we assume that image frames of a moving scene can be modeled as a Linear Dynamical System. We propose a sparse coding framework, named adaptive video dictionary learning (AVDL), to model a video adaptively. The developed framework is able to capture the dynamics of a moving scene by exploring both sparse properties and the temporal correlations of consecutive video frames. The proposed method is compared with state of the art video processing methods on several benchmark data sequences, which exhibit appearance changes and heavy occlusions.


Analyzing Linear Dynamical Systems: From Modeling to Coding and Learning

Encoding time-series with Linear Dynamical Systems (LDSs) leads to rich ...

Denoising and Completion of 3D Data via Multidimensional Dictionary Learning

In this paper a new dictionary learning algorithm for multidimensional d...

Deep Micro-Dictionary Learning and Coding Network

In this paper, we propose a novel Deep Micro-Dictionary Learning and Cod...

Dictionary Learning and Sparse Coding for Third-order Super-symmetric Tensors

Super-symmetric tensors - a higher-order extension of scatter matrices -...

Dictionary Learning by Dynamical Neural Networks

A dynamical neural network consists of a set of interconnected neurons t...

Rain Streak Removal in a Video to Improve Visibility by TAWL Algorithm

In computer vision applications, the visibility of the video content is ...

Alignment Distances on Systems of Bags

Recent research in image and video recognition indicates that many visua...

1 Introduction

Temporal or dynamic textures (DT) are image sequences that exhibit spatially repetitive and certain stationarity properties in time. This kind of sequences are typically videos of processes, such as moving water, smoke, swaying trees, moving clouds, or a flag blowing in the wind. Study and analysis of DT is important in several applications such as video segmentation [1], video recognition [2], and DT synthesizing [3].

One classical approach is to model dynamic scenes via the optical flow [4]. However, such methods require a certain degree of motion smoothness and parametric motion models [1]. Non-smoothness, discontinuities, and noise inherence to rapidly varying, non-stationary DTs (e.g. fire) pose a challenge to develop optical flow based algorithms. Another technique, called particle filter [5], models the dynamical course of DTs as a Markov process. A reasonable assumption in DT modeling is that each observation is correlated to an underlying latent variable, or “state”, and then derive the parameter transition operator between these states.

Some approaches directly view each observation as a state, and then focus on transitions between the observations in the time domain. For instance, the work in [6]

treats this transition as an associated probability problem, and other methods construct a spatio-temporal autoregressive model (STAR) or position affine operator for this transition

[7, 8].

Differently, feature-based models capture the intrinsic law and underlying structures of the data by projecting the original data onto a low-dimensional feature space via feature extracted techniques, such as principle component analysis (PCA). G. Doretto et al.

[2, 3] model the evolution of the dynamic textured scenes as a linear dynamical system (LDS) under a Gaussian noise assumption. As a popular method in dynamic textures, LDS and its derivative algorithms have been successfully used for various dynamic texture applications [3, 2]. However, constraints are imposed on the types of motion and noise that can be modeled in LDS. For instance, it is sensitive to input variations due to various noise. Especially, it is vulnerable to non-Gaussian noise, such as missing data or occlusion of the dynamic scenes. Moreover, stability is also a challenging problem for LDS [9].

To tackle these challenges, the approach taken here is to explore an alternative method to model the DTs by appealing to the principle of sparsity. Instead of using the Principle Components (PCs) as the transition “states” in LDS, sparse coefficients over a learned dictionary are imposed as the underlying “states”. In this way, the dynamical process of DTs exhibits a transition course of corresponding sparse events. These sparse events can be obtained via a recent technique on linear decomposition of data, called dictionary learning [10, 11]. Formally, these sparse representations to a signal , can be written as

where is a dictionary, and is sparse, i.e. most of its entries are zero or small in magnitude. That is, the signal can be sparsely represented only using a few elements from some dictionary .

In this work, we start with a brief review of the dynamic texture model from the viewpoint of convex optimization, and then deduce a combined regression associated with several regularizations for a joint process—“states extraction” and “states transition”. Then we treat the solution of the above combined regression as an adaptive dictionary learning problem, which can achieve two distinct yet tightly coupled tasks— efficiently reducing the dimensionality via sparse representation and robustly modeling the dynamical process. Finally, we cast this dictionary learning problem as the optimization of a smooth non-convex objective function, which is efficiently resolved via a gradient descent method.

2 Adaptive Video Dictionary Learning

In this section, we start with a brief introduction to the linear dynamical systems (LDS) model and develop an adaptive dictionary learning framework for sparse coding.

2.1 Linear Dynamical Systems

Let us denote a given sequence of frames by , where the time is indexed by . The evolution of a LDS is often described by the following two equations


where , , and denote the observation, its hidden state or feature, state noise, and observation noise, respectively. The system is described by the dynamics matrix , and the modeling matrix

. Here we are interested in estimating the system parameters

and , together with the hidden states, given the sequence of observations .

The problem of learning the LDS (1

) can be considered as a coupled linear regression problem

[9]. Let us denote , , and . The system dynamics and modeling matrix are expected to be caught by solving the following minimization problem,


where is a small positive constant. In our approach, we assume that all observations admit a sparse representation with respect to an unknown dictionary , i.e.


where is sparse. Without loss of generality, we further assume that all columns of the dictionary have unit norm. We then define the set


where is the diagonal matrix whose entries on the diagonal are those of ,

denotes the identity matrix. The set

is the product of unit spheres, and is hence a dimensional smooth manifold. Finally, by adopting the common sparse coding framework to problem (2), we have the following minimization problem


where , denotes the Frobenius norm of matrices, and is the norm, which measures the overall sparsity of a matrix. The parameter weighs the sparsity measurement against the residual errors.

2.2 A Dictionary Learning Model for Dynamical Scene

Solving the minimization problem as stated in Eq. (5) is a very challenging task. In this work, we employ an idea similar to subspace identification methods [9], which treat the state as a function of . Here, we confine ourselves to the sparse solution of an elastic-net problem, which is proposed in [12], as


where and are regularization parameters, which play an important role in ensuring stability and uniqueness of the solutions. Let us define the set of indices of the non-zero entries of the solution as


Then the solution has a closed-form expression as


where carries the signs of , is the subset of in which the index of atoms (rows) fall into support . Furthermore, it is known that the solution as given in (8) is a locally twice differentiable function at . By an abuse of notation, we define


In a similar way, is defined. Thus, the cost function reads as


It is known that an LDS with the dynamic matrix

is said to be stable, if the largest eigenvalue of

is bounded by [9]. Let be the largest eigenvalue of , then Thus, we enforce the small via imposing a penalty on (10), and then end up with the cost function as


2.3 Development of the Algorithm

In this section, we firstly derive a gradient descent algorithm to minimize (11) and then discuss some details of the choice of the parameters in the final implementation.

We start with the computation of the first derivative of the sparse solution of the elastic-net problem as given in (8). Given the tangent space of at as


the orthogonal projection of a matrix onto the tangent space with respect to the inner product is given by


Let us denote . The first derivative of in the direction is


By the product structure of , the Riemannian gradient of the function is


Here, the Euclidean gradient of with respect to is computed as


with being the

-th standard basis vector of

. Using the shorthand notation, , , and , the Euclidean gradient of with respect to is


For a gradient search iteration on manifolds, we employ the following smooth curve on through in direction


with . It essentially normalizes all columns of . For a detailed overview on optimization on matrix manifold, refer to [13].

1:  Training data
2:  Initialize the parameters ,,, initial dictionary , and initial transition matrix .
3:  for  do
4:     Sparse Coding Stage
Use Lasso algorithm to compute via
Compute the active set for each .
5:     Compute the gradient of according to (16) and (17).
6:     Update the parameters and
7:  end for
8:  return   and
Algorithm 1 Adaptive Video Dictionary Learning

Until now, we have computed the gradient of as defined in (11) with respect to its two arguments and . An iterative scheme (such as the gradient descent method or conjugate gradient method) can be used to find the optimal and , using the gradient expression above. The procedure displayed in Algorithm (1) is the version of AVDL based on gradient descent procedure. The learning rate can be computed via the well-known backtracking line search method, similar to [11]. Here, considering the high coherence among the temporal frames, we prefer non-redundant dictionary, that is, for the dictionary . For parameters in the elastic net, we put an emphasis on sparse solutions and choose , as proposed in [12].

Instance LDS, (PCs) AVDL, , (loops)
64 128 256 1 50 100 200 400
Compression rate (%) 6.25 12.50 25.00 1.02 3.29 3.41 3.50 3.55
0.9802 0.9833 0.9849 1.78 1.06 0.9992 0.9994 0.9994
60.29 71.27
171.99 75.52 61.96 46.18
Table 1: Synthesizing results on sequence of burning candle.

3 Numerical Experiments

(a) Corrupted original sequence
(b) Reconstructed sequence
(c) Synthesized video using LDS and AVDL on DTs with Gaussian noise
(d) Synthesized video using LDS and AVDL on DTs with missing data
Figure 1: Reconstruction and synthesizing on the candle scene. (a), (b) are frame of the corrupted data by Gaussian noisy and the reconstructed data using AVDL, respectively. (c) The top row is the synthesized sequence using LDS (128PCs), and the bottom row is the synthesized sequence using AVDL ( frame). (d) The top row is the sequence with missing data. The middle row the synthesized sequence using LDS, and the bottom row is the synthesized sequence using AVDL.

We carry out a few experiments on natural image sequences data, and demonstrate the practicality of the proposed algorithm. Our test dataset comprises of videos from DynTex++ [14], and data from internet sources (for instance, YouTube). Firstly, we show the performance on reconstruction and synthesizing with a grayscale video of burning candle, which is corrupted by Gaussian noise or occlusion. This video has 1024 frames with size of , see figure 1. The initial dictionary is . After the acquisition of the dictionary and the transition , the synthesized data can be generated easily by , or more precisely, using a convex formulation

Table 1 shows the performance of synthesizing on burning candle with Gaussian noise. The error pairs are defined as , , and the largest eigenvalue of is denoted by . The compression rate for AVDL is sparsity of to , and for LDS is number of PCs to . Table 1 shows AVDL can obtain the stable dynamic matrix , smaller compression rate and smaller error of cost function (5), by increasing the numbers of main loops in Algorithm 1.

Figure 1 is the visual comparison between LDS and AVDL. AVDL performs well on denoising against corruption by Gaussian noise. In the case of occlusion in figure 1 (d), random 50 frames of the 1024 burning candle video are corrupted by a rectangle. The length of both synthesizing data is 1024, based on first frame of the burning candle. of the synthesizing data from LDS are corrupted by this rectangle, but for AVDL.

Occlusion rate (%) 0 5 15 30
LDS-NN (128PCs) 69.72 45.00 25.14 14.17

70.28 64.72 44.44 22.36

Table 2: DT recognition rates for videos with occlusion.

The second experiment is about scenes classification on DynTex++, which contains DTs from 36 classes. Each class has 100 subsequences of length 50 frames with

pixels. 20 videos are randomly chosen in each class and total 720 videos are used for our experiments. Classification for LDS is performed using the Martin distance with a nearest-neighbor classifier on its parameters pair

[2]. Another classifier is AVDL associated with the sparse representation-based classifier (SRC) [15, 16], in which the class of a test sequence is determined by the smallest reconstruction error and transition error . Table 2 provides the recognition results with increasing occlusion rates for test data. Compared to LDS with nearest-neighbor classifier (LDS-NN), Table 2 shows the proposed AVDL with SRC (AVDL-SRC) performs better while the test videos are corrupted by increasing occlusion.

4 Conclusions

This paper proposes an alternative method, called AVDL, to model the dynamic process of DTs. In AVDL, the sparse events over a dictionary are imposed as transition states. The proposed method show a robust performance for synthesizing, reconstruction and recognition on DTs corrupted by Gaussian noise. Especially, AVDL exhibits more powerful in the case of test data with non-Gaussian noise, such as occlusion. One possible future extension is to learn a dictionary for large scale DT sequences based on AVDL.


  • [1] Antoni B Chan and Nuno Vasconcelos, “Layered dynamic textures,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 10, pp. 1862–1879, 2009.
  • [2] Payam Saisan, Gianfranco Doretto, Ying Nian Wu, and Stefano Soatto, “Dynamic texture recognition,” in

    Computer Vision and Pattern Recognition. IEEE Computer Society Conference on

    . IEEE, 2001, vol. 2, pp. II–58.
  • [3] Gianfranco Doretto, Alessandro Chiuso, Ying Nian Wu, and Stefano Soatto, “Dynamic textures,” International Journal of Computer Vision, vol. 51, no. 2, pp. 91–109, 2003.
  • [4] Berthold KP Horn and Brian G Schunck, “Determining optical flow,” Artificial intelligence, vol. 17, no. 1, pp. 185–203, 1981.
  • [5] Petar M Djuric, Jayesh H Kotecha, Jianqui Zhang, Yufei Huang, Tadesse Ghirmai, Mónica F Bugallo, and Joaquin Miguez, “Particle filtering,” Signal Processing Magazine, IEEE, vol. 20, no. 5, pp. 19–38, 2003.
  • [6] Arno Schödl, Richard Szeliski, David H Salesin, and Irfan Essa, “Video textures,” in Proceedings of the 27th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 2000, pp. 489–498.
  • [7] Martin Szummer and Rosalind W Picard, “Temporal texture modeling,” in International Conference on Image Processing. IEEE, 1996, vol. 3, pp. 823–826.
  • [8] Vivek Kwatra, Arno Schödl, Irfan Essa, Greg Turk, and Aaron Bobick, “Graphcut textures: image and video synthesis using graph cuts,” in Graphics (TOG), ACM Transactions on. ACM, 2003, vol. 22, pp. 277–286.
  • [9] Byron Boots, Geoffrey J Gordon, and Sajid M Siddiqi, “A constraint generation approach to learning stable linear dynamical systems,” in Advances in Neural Information Processing Systems, 2007, pp. 1329–1336.
  • [10] Michael Elad and Michal Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” Image Processing, IEEE Transactions on, vol. 15, no. 12, pp. 3736–3745, 2006.
  • [11] Simon Hawe, Matthias Seibert, and Martin Kleinsteuber, “Separable dictionary learning,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, June 2013, pp. 438–445.
  • [12] Hui Zou and Trevor Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.
  • [13] P-A Absil, Robert Mahony, and Rodolphe Sepulchre, Optimization algorithms on matrix manifolds, Princeton University Press, 2009.
  • [14] Bernard Ghanem and Narendra Ahuja, “Maximum margin distance learning for dynamic texture recognition,” in European Conference on Computer Vision, pp. 223–236. Springer, 2010.
  • [15] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma,

    “Robust face recognition via sparse representation,”

    Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 2, pp. 210–227, 2009.
  • [16] Bernard Ghanem and Narendra Ahuja, “Sparse coding of linear dynamical systems with an application to dynamic texture recognition,” in Pattern Recognition (ICPR), 2010 20th International Conference on. IEEE, 2010, pp. 987–990.